Summary

The paper presents a fine-tuned RoBERTa-based model for tabular data classification. The authors use what they call Relative Magnitude Tokens (decision tree binning + scaling by magnitude) in combination with the Intra-Feature Attention.

The authors propose:

  • Magnitude-aware regularization to share bin embeddings across features/datasets.
  • Intra-Feature Attention (IFA) to mix different tokens of a feature before merging.

Approach

Overview of approach

Overview of approach

Relative Magnitude Tokens:

  • Decision tree binning + multiply by magnitude.
  • Magnitude-aware regularization to share the bin embeddings across feature/datasets with a triplet loss.
    • Triplet Loss: Similar to contrastive loss, but use a triplet formulation, where there is an anchor, positive and negative samples.
Lreg=max(dist(f(k1),f(k2))dist(f(k1),f(k3))+mag(k1,k2,k3),0),s.t. k1k2<k1k3f(k)=LayerNorm(Linear(Embed(k,E)))mag(k1,k2,k3)=k1k3k1k2nbin\begin{gather} L_{reg} = \max(\text{dist}(f(k_1), f(k_2)) - \text{dist}(f(k_1), f(k_3)) + \text{mag}(k_1,k_2,k_3),0),\\ s.t.~|k_1 - k_2 | < |k_1 - k_3| \\ f(k) = \text{LayerNorm}(\text{Linear}(\text{Embed}(k, E))) \\ \text{mag}(k_1, k_2, k_3) = \frac{|k_1-k_3| - |k_1-k_2|}{n_{\text{bin}}} \end{gather}

Intra-Feature Attention:

  • Mix the different tokens of a feature (name, value) into a CLS token before the features are merged. One MHSA module is shared across features for this.

Findings

  • Starting from the pre-trained RoBERTa weights is good.
  • IFA is good.
  • GBDTs are still better when most features are numerical.
  • But TP-BERT shines on categorical.

    QUESTION

    Why could this be? TabPFN shows better performance on numerical-dominated features.

  • XGB requires more optimization -- Catboost offers better out-of-the-box performance.

Resources