Summary

This work proposes a method to convert table rows into a sequence of token-embeddings similar to how NLP methods process text. This formulation allows for a more flexible usage of learned information from tables, such as transfer-learning, pre-train/fine-tune, incremental-learning or even zero-shot inference. This method shows superior performance compared to existing methods, and is widely adopted by following tabular works.

Approach

Overview of approach

Overview of the approach

Input transformation

  • Categorical: Column name + cell value (tokenized & embedded).
  • Continuous: Column name (tokenized & embedded) * cell value.
  • Then stack everything into a sequence.

Transformer backbone

Pretty much the same as vanilla transformer, but use a GLU (gated linear unit) instead of MLP for the FFN module. For layer ll:

Zattl=MHSA(Zl)Zl+1=Linear((glZattl)Linear(Zl))\begin{gather} \bf{Z}_{\text{att}}^l = \text{MHSA}(\bf{Z}^l) \\ \bf{Z}^{l+1} = \text{Linear}((\bf{g}^l \odot \bf{Z}_{\text{att}}^l) \oplus \text{Linear}(\bf{Z}^l)) \end{gather}

where gl=σ(ZattlwG)[0,1]n\bf{g}^l = \sigma(\bf{Z}_{\text{att}}^l \bf{w}^G) \in [0,1]^n is a token-wise gating layer where σ\sigma is the sigmoid function and \odot is the element-wise product and \oplus is the element-wise addition.

QUESTION

The authors say the gating is meant to focus on important features by redistribution the attention on tokens, but how that is actually achieved is not made very clear to me. Maybe this is just the point of the gating mechanism and I'm not getting it?

Vertical Partitioned Contrastive Learning (VPCL)

Diagram of supervised and self-supervised VPCL

Diagram of supervised and self-supervised VPCL

Top shows the self-supervised and the bottom shows the supervised variant.

Self-Supervised VPCL

Given sample xi={vi1,...,viK}\bf{x}_i = \{\bf{v}_i^1, ... , \bf{v}_i^K\} with KK partitions, use partitions from the same sample as positive and others as negative.

(X)=i=1Bk=1Kk=1Klogexpϕ(vik,vik)j=1Bk=1Kexpϕ(vik,vjk)\ell(\bf{X}) = - \sum \limits_{i=1}^{B} \sum \limits_{k=1}^{K} \sum \limits_{k'=1}^{K} \log \cfrac{\exp\phi(\bf{v}_i^k, \bf{v}_i^{k'})}{\sum_{j=1}^{B} \sum_{k^{\dagger}=1}^{K} \exp \phi(\bf{v}_i^k, \bf{v}_j^{k^{\dagger}})}

where BB is the batch size, ϕ\phi is the cosine similarity of the [CLS] embeddings of the partitions,

INTUITION

We want the similarity of partitions from the same sample to be higher (expϕ(vik,vik)\exp\phi(\bf{v}_i^k, \bf{v}_i^{k'})), while keeping the similarity of partitions from different samples low (expϕ(vik,vjk)\exp \phi(\bf{v}_i^k, \bf{v}_j^{k^{\dagger}})). I think the authors forgot to include jij \neq i when sampling the batches.

Supervised VPCL

MOTIVATION

The authors argue that pre-training using task(dataset)-specific classifier heads with supervised loss is inadequate for transferrability, as it may cause the encoder to be biased to the major tasks and classes.

In addition, the authors propose a supervised contrastive-learning scheme inspired by Khosla et. al1, called Vertical Partitioned Contrastive Learning (VPCL).

(X,y)=i=1Bj=1Bk=1Kk=1K1{yj=yi}logexpϕ(vik,vjk)j=1Bk=1K1{yjyi}expϕ(vik,vjk)\ell (\bf{X}, \bf{y}) = - \sum \limits_{i=1}^{B} \sum \limits_{j=1}^{B} \sum \limits_{k=1}^{K} \sum \limits_{k'=1}^{K} \mathbf{1} \{y_j = y_i\} \log \cfrac{\exp \phi (\bf{v}_i^k, \bf{v}_j^{k'})}{\sum_{j^{\dagger}=1}^B \sum_{k^{\dagger}=1}^{K} \mathbf{1}\{y_{j^{\dagger}} \neq y_i\} \exp \phi(\bf{v}_i^k, \bf{v}_{j^{\dagger}}^{k^{\dagger}})}

where y={yi}iB\bf{y} = \{y_i\}_i^{B} are batch labels and 1{}\mathbf{1}\{\cdot\} is an indicator function (so any cases that do not meet the condition are zero-ed out).

INTUITION

We want the similarity of partitions with the same label to be higher (expϕ(vik,vjk)\exp\phi(\bf{v}_i^k, \bf{v}_j^{k'})), while keeping the similarity of partitions with different labels low (expϕ(vik,vjk)\exp \phi(\bf{v}_i^k, \bf{v}_{j^{\dagger}}^{k^{\dagger}})).

Findings

Vanilla supervised setting

  • Logistic regression is surprisingly strong.
  • Aside from TransTab, only FT-Transformer shows competitive performance.
  • TransTab is better than FT-Transformer in most cases.

Incremental columns

SETTING

Where the new data includes previously unseen columns. Baselines would need to either drop the old data or ignore the new columns.

  • TransTab outperforms.

Transfer learning

SETTING

Split datasets into two sets with some overlapping columns. Baselines can train on just one and test on it. TransTab is pre-trained on one and fine-tuned/tested on the other.

  • TransTab outperforms.
    • But not in all cases anymore! Why does XGB do better in one subset than TransTab? Is this splitting unintentionally doing some feature selection??

Zero-shot inference

SETTING

Split datasets into three sets with no overlapping columns. TransTab is trained on 2 of the sets and tested on other for zero-shot. Compare with train/test on just one set and pre-train on 2 sets, fine-tune/test on last set.

  • Transfer shows best performance.
  • Somehow zero-shot is better than supervised!
    • Since two-sets worth of columns is more information than just one set?

VPCL vs. Vanilla supervised pre-training vs. Vanilla self-supervised pre-training

SETTING

Compare against vanilla supervised, transferring with vanilla supervised, and VPCL (both self-supervised and supervised).

  • Self-supervised VPCL shows better performance over supervised variant when partition numbers increase.
  • Vanilla supervised transferring can sometimes even hurt performance (worse than just vanilla single-dataset setting).
    • Only on 5 datasets though.

QUESTION

Is supervised-transfer same as "pre-training on supervised loss on other datasets"? Or is the model already pre-trained (maybe with VPCL) and is fine-tune/transferred for vanilla supervised setting?

Resources

Footnotes

  1. Khosla, Prannay, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. "Supervised contrastive learning." Advances in neural information processing systems 33 (2020): 18661-18673.