Published on

TabSTAR: A Foundation Tabular Model With Semantically Target-Aware Representations

Summary

The authors propose TabSTAR, a foundation model that brings together text and table structure. Similar to TP-BERTa, each cell is transformed into a single token that encompasses both the cell value and the column name. The authors introduce a unified classification head that allows for cross-table training without learning task-specific heads.

Approach

Architecture

TabSTAR Architecture

TabSTAR Architecture

Verbalization

  • Serialize the cells into a some text, e.g. "column_name: cell_value".
  • For numericals, apply binning (10 quantiles), and use the scale-aware text for the text part (e.g. Age: 40–50 (Quantile 50–60%)) - Also, learn magnitude-aware embeddings for each bin.

Text Encoding

  • Use e5 model to get the embeddings of the tokens and apply attention.
    • The e5 model is also lightly fine-tuned (top NN layers).
  • A single layer transformer block is then used to now "fuse" the embeddings into a single one. (similar to IFA module)

Classification

  • The sequence, where each token is a cell, is then passed through a 6-layer transformer that maps the "interaction" between the cells.
  • The label tokens of the resulting sequence are then used as input to a linear classifier head.
    • The single linear layer turns each label token into a logit.
    • The labels are added to the input sequence after all the row information.

Experiments

Training Data

  • Manually curated 350 tabular datasets (253 classification, 97 regression

Evaluation Data

  • AutoML Multimodal Benchmark1
  • Benchmark from CARTE2

Baselines

  • TabPFNv2 (closed-source text version), CARTE

Findings

  • TabSTAR is strong on classification tasks
    • TabPFNv2 comes very close on < 10k samples, but in the larger datasets, the gap is larger.
  • TabPFNv2 does better on regression compared to other NN-based methods. But GBDTs are still much better.
  • Fine-tuning the encoder (e5) leads to better performance.
    • The results show that unfreezing the top 9 layers actually leads to worse performance than 6 or 3, but the authors think that this is due to lack of data.
  • Pre-training is significantly better than training from scratch.
  • Adding both the quantile name (string) and quantile vector (learnable param) is better than using only one of them.

Resources

Footnotes

  1. https://github.com/sxjscience/automl_multimodal_benchmark

  2. arxiv