- Published on
Making Pre-trained Language Models Great on Tabular Prediction
- Name
- Inwon Kang
Summary
The paper presents a fine-tuned RoBERTa-based model for tabular data classification. The authors use what they call Relative Magnitude Tokens (decision tree binning + scaling by magnitude) in combination with the Intra-Feature Attention.
The authors propose:
- Magnitude-aware regularization to share bin embeddings across features/datasets.
- Intra-Feature Attention (IFA) to mix different tokens of a feature before merging.
Approach
Overview of approach
Relative Magnitude Tokens:
- Decision tree binning + multiply by magnitude.
- Magnitude-aware regularization to share the bin embeddings across feature/datasets with a triplet loss.
- Triplet Loss: Similar to contrastive loss, but use a triplet formulation, where there is an anchor, positive and negative samples.
Intra-Feature Attention:
- Mix the different tokens of a feature (name, value) into a CLS token before the features are merged. One MHSA module is shared across features for this.
Findings
- Starting from the pre-trained RoBERTa weights is good.
- IFA is good.
- GBDTs are still better when most features are numerical.
- But TP-BERT shines on categorical.
QUESTION
Why could this be? TabPFN shows better performance on numerical-dominated features.
- XGB requires more optimization -- Catboost offers better out-of-the-box performance.