Summary

Deep learning on tabular data faces three challenges:

  1. rotational variance -- the order of columns should not matter.
  2. large data demand -- DNNs have a larger hypothesis space, and require more training data compared to shallow algorithms.
  3. over-smooth solution -- DNNs tend to produce overly smooth solutions. i.e. when faced with irregular decision boundaries, the learning algorithms suffer (as pointed out by Grinsztajn et. al

Approach

Overview

Overview of aproach

3 main components: Semi-Permeable Attention, Interpolation-based data-augmentation(Feat-mix, HID-mix), and Attentive FFNs.

Semi-Permiable Attention

The authors propose to add a mask to the attention score matrix such that the less important features do not influence more important features, but the more important features can influence the less important features.

z=softmax((zWq)(zWk)TMd)(zWv)z' = \text{softmax}(\cfrac{(z W_q) (z W_k)^T \underline{\oplus M}}{\sqrt{d}}) (z W_v)

where \oplus denotes element-wise addition and M\oplus M is the proposed change to vanilla MHSA. MRf×fM \in \mathbb{R}^{f \times f} is a fixed mask, where

M[i,j]={I(fi)>I(fj)0I(fi)I(fj)M[i,j] = \begin{cases} -\infty & I(\bf{f}_i) \gt I(\bf{f}_j) \\ 0 & I(\bf{f}_i) \leq I(\bf{f}_j) \end{cases}

where I(fi)I(\bf{f}_i) is the importance of the ii-th feature. In other words, this terms means that less informative features may use information from more informative features (case 0), but the opposite is blocked.

Interpolation-based data-augmentation

Picture of HID and FEAT mix

Illustration of HID and FEAT mix

HID-mix operates on the embedding level, while FEAT-mix operates on the feature level.

HID-mix

Given two samples z1(0),z2(0)Rf×dz_1^{(0)}, z_2^{(0)} \in \mathbb{R}^{f \times d} and their labels y1,y2y_1, y_2, a new sample can be formed by mixing the embedding dimensions of z1(0)z_1^{(0)} and z2(0)z_2^{(0)}:

\begin{gather} z_m^{(0)} = S_H \odot z_1^{(0)} + (\mathbb{1}-S_H) \odot z_2^{(0)}\\ y_m = \lambda_H y_1 + (1-\lambda_H) y_2 \end{gather}

where SH{0,1}f×dS_H \in \{0,1\}^{f \times d} is a stack of binary masks sh:SH=[sh,sh,...,sh]Ts_h:S_H = [s_h, s_h,..., s_h ]^T, where \sum s_h = \left\lfloor \lambda_H \cdot d \right\rfloor for each row vector shs_h, and 1\mathbb{1} is a f×df \times d matrix of 11s. In other words, SHS_H masks out \left\lfloor \lambda_H \cdot d \right\rfloor entries of each row.

INTUITION

Since each embedding element is projected from a scalar feature value, we can consider each embedding dimension as a distinct "profile" version of input data. Thus, Hid-Mix regularizes the classifier to behave like a bagging predictor.

FEAT-mix

Instead of mixing the embedding, FEAT-mix mixes the features given two samples x1,x2Rfx_1, x_2 \in \mathbb{R}^{f} and their labels y1,y2y_1, y_2, a new sample can be formed by mixing the features of x1x_1 and x2x_2:

xm=sFx1+(1FsF)x2ym=Λy1+(1Λ)y2\begin{gather} x_m = s_F \odot x_1 + (\mathbb{1}_F-s_F) \odot x_2 \\ y_m = \Lambda y_1 + (1-\Lambda) y_2 \end{gather}

where sF{0,1}fs_F \in \{0,1\}^{f} is a binary mask vector where \sum s_F = \left\lfloor \lambda_F \cdot f \right\rfloor, 1F\mathbb{1}_F is a ff dimensional vector of 11s, and Λ\Lambda is a scalar.

If we set \Lambda = \lambda_F, this equivalent to cutmix1.

To differentiate, the authors introduce the usage of feature importance in the label weighting as follows:

Λ=s(i)FI(fi)i=1fI(fi)\Lambda = \cfrac{\sum_{s^{(i)_F}}I(\bf{f}_i)}{\sum_{i=1}^{\bf{f}}I(\bf{f}_i)}

where sF(i)s_F^{(i)} is the ii-th element of sFs_F, and I(fi)I(\bf{f}_i) is the importance of the ii-th feature. Similarly to the SPA module, the mutual information is what the authors appear to use.

INTUITION

Since each feature may have different contribution to the label, weighing the two labels by how much "usefulness" each sample contributed allows uninformative features to be filtered.

Attentive FFNs

Finally, the authors propose to replace the 2-layer FFN module at the end of the transformer block with a 2-layer Gated Linear Unit (GLU) module instead, like the following:

z=tanh(Linear1(z))Linear2(z)z' = \text{tanh}(\text{Linear}_1(z)) \odot \text{Linear}_2(z)

where \odot is element-wise multiplication and the first term acts as the gate.

In addition, the authors replace the linear embedding layer with similar GLU setup as well, which used to be zi=fiWi,1+bi,1z_i = \bf{f}_i W_{i,1} + b_{i,1}, into zi=tanh(fiWi,1+bi,1)fiWi,2+bi,2z_i = \text{tanh}(\bf{f}_i W_{i,1} + b_{i,1}) \odot \bf{f}_i W_{i,2} + b_{i,2}.

However, why they do this is not very clearly motivated.

Findings

  • Excelformer works well on both small and large datasets!
    • While other models need HPO to be competitive, Excelformer is competitive without HPO.
  • Both data augmentation methods are effective.

Resources

Footnotes

  1. Yun, Sangdoo, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. "Cutmix: Regularization strategy to train strong classifiers with localizable features." In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023-6032. 2019.