- Published on
Can a Deep Learning Model be a Sure Bet for Tabular Prediction?
Summary
Deep learning on tabular data faces three challenges:
- rotational variance -- the order of columns should not matter.
- large data demand -- DNNs have a larger hypothesis space, and require more training data compared to shallow algorithms.
- over-smooth solution -- DNNs tend to produce overly smooth solutions. i.e. when faced with irregular decision boundaries, the learning algorithms suffer (as pointed out by Grinsztajn et. al
Approach
Overview of aproach
3 main components: Semi-Permeable Attention, Interpolation-based data-augmentation(Feat-mix, HID-mix), and Attentive FFNs.
Semi-Permiable Attention
The authors propose to add a mask to the attention score matrix such that the less important features do not influence more important features, but the more important features can influence the less important features.
where denotes element-wise addition and is the proposed change to vanilla MHSA. is a fixed mask, where
where is the importance of the -th feature. In other words, this terms means that less informative features may use information from more informative features (case 0), but the opposite is blocked.
Interpolation-based data-augmentation
Illustration of HID and FEAT mix
HID-mix operates on the embedding level, while FEAT-mix operates on the feature level.
HID-mix
Given two samples and their labels , a new sample can be formed by mixing the embedding dimensions of and :
\begin{gather} z_m^{(0)} = S_H \odot z_1^{(0)} + (\mathbb{1}-S_H) \odot z_2^{(0)}\\ y_m = \lambda_H y_1 + (1-\lambda_H) y_2 \end{gather}where is a stack of binary masks , where \sum s_h = \left\lfloor \lambda_H \cdot d \right\rfloor for each row vector , and is a matrix of s. In other words, masks out \left\lfloor \lambda_H \cdot d \right\rfloor entries of each row.
INTUITION
Since each embedding element is projected from a scalar feature value, we can consider each embedding dimension as a distinct "profile" version of input data. Thus, Hid-Mix regularizes the classifier to behave like a bagging predictor.
FEAT-mix
Instead of mixing the embedding, FEAT-mix mixes the features given two samples and their labels , a new sample can be formed by mixing the features of and :
where is a binary mask vector where \sum s_F = \left\lfloor \lambda_F \cdot f \right\rfloor, is a dimensional vector of s, and is a scalar.
If we set \Lambda = \lambda_F, this equivalent to cutmix1.
To differentiate, the authors introduce the usage of feature importance in the label weighting as follows:
where is the -th element of , and is the importance of the -th feature. Similarly to the SPA module, the mutual information is what the authors appear to use.
INTUITION
Since each feature may have different contribution to the label, weighing the two labels by how much "usefulness" each sample contributed allows uninformative features to be filtered.
Attentive FFNs
Finally, the authors propose to replace the 2-layer FFN module at the end of the transformer block with a 2-layer Gated Linear Unit (GLU) module instead, like the following:
where is element-wise multiplication and the first term acts as the gate.
In addition, the authors replace the linear embedding layer with similar GLU setup as well, which used to be , into .
However, why they do this is not very clearly motivated.
Findings
- Excelformer works well on both small and large datasets!
- While other models need HPO to be competitive, Excelformer is competitive without HPO.
- Both data augmentation methods are effective.
Resources
Footnotes
Yun, Sangdoo, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. "Cutmix: Regularization strategy to train strong classifiers with localizable features." In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023-6032. 2019. ↩