- Published on
TableGPT2: A Large Multimodal Model with Tabular Data Integration
Summary
A summary of TableGPT2, a fine-tuned Qwen2 based model for native tabular data processing.
Approach
Semantic Table Encoder
Similar to vision-language models, the authors opt to use a Q-former1 style adapter framework for aligning representation of table columns to the language model.
Semantic Table Encoder
A description of how the tabular data is directly encoded into tokens. Embeddings from the encoder (stacks of col and row-wise attention blocks with no positional embeddings) are passed through an adapter that combines them into a single representation of embeddings.
Then the table is serialized into a hybrid form of both text and embeddings:
table tab_name, columns=[tab_name.col_name(<col_emb>|dtype|if_primary_key)|[values]]
where <col_emb>
is replaced with the embeddings from the adapter.
Encoder Pre-Training
The encoder is pre-trained with a contrastive loss on the column embeddings. A table is split into two snapshots and , and the encoder tries to pull together matching columns from and , while pushing apart the rest:
where is the temperature and is the positive pair (embedding of same column) and is the pool of column embeddings from both and .
Encoder/Adapter Joint Training
To align the encoder's embeddings to the language model, both the encoder and the adapter are trained jointly on the following tasks:
Synthetically created tasks:
- column prediction
- cell prediction
Existing datasets (FetaQA, WikiTableQuestion, ToTTo) are modified to create more tasks:
- question generation
- table titling
- row summarization
Data Collection
The authors collect two types of data for training TableGPT.
- Text data for continuous pre-training of the Qwen2 base model.
- Tabular data for supervised fine-tuning of the table encoder/adapter and language model.
Type | Description |
---|---|
Database | Multi-table setting, large and numerical |
Web Page | Simple and have contextual text data |
Excel | Structured data, government, finance, etc. |
Academic Task | data used in research, often suitable for TableQA or NL2SQL |
Special Format | Specific formats, like invoice, bill, etc. |
Pre-test Task | Forcasting, prediction projects etc. |
Types of tables gathered
Using this, the authors use larger models (GPT4-o, LLaMA, etc.) to generate a set of queries of broadly two types:
- Single-turn
- Multi-turn -- where the user may ask to improve the previous response or ask additional question(s).
Once the queries are generated and vetted by human annotators, the larger models are again used to generate the answers to these queries. The authors employ a strategy they call synthesize and refine to generate higher-quality responses using these larger LLMs.
Agent Framework
Agentic Framework
A core pipeline diagram of a agent framework using TableGPT2.
The authors propose an agent framework with TableGTP2. The tablular agent is able to generate code to parse/grab tables to achieve its task.
Benchmarking
Existing Benchmarks
- Table Understanding
- TURL: table interpretation (e.g., column type annotation, relation extraction, entity linking) and table augmentation (e.g., row population)
- Table Question Answering (TableQA)
- WikiTableQuestion, WikiSQL: answering natural language questions based on tabular data
- FeTaQA: free-form answers.
- HybridQA: multi-hop and table/text mix.
- Table Fact Verification
- Table-to-Text Generation (Table2Text)
- Natural Language to SQL (NL2SQL)
- Holistic
- TableBench for assessing reasoning capabilities over tables.
RealTabBench
The authors argue that most existing benchmarks are too simple and fail to capture a real-life scenario. To remedy this, they propose a new tabular benchmarking for language models with the following assets:
- 360 real-life tables from business intelligence scenarios.
- 6,000 queries.
Human reviewers and LLMs are combined to produce the final metric.
Findings
- TableGPT2 outperforms other models on RealTabBench.
- But it also retains most of its general-task (non-table related) capabilities.
The authors exclude TableLlama2 because it loses general-task capabilities.
Resources
Footnotes
Li, Junnan, Dongxu Li, Silvio Savarese, and Steven Hoi. "Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models." In International conference on machine learning, pp. 19730-19742. PMLR, 2023. ↩
Zhang, Tianshu, Xiang Yue, Yifei Li, and Huan Sun. "Tablellama: Towards open large generalist models for tables." arXiv preprint arXiv:2311.09206 (2023). Huggingface link ↩