Structured data sources, such as tables, graphs, and databases, are ubiquitous knowledge sources. Despite the demonstrated capabilities of large language models (LLMs) on plain text, their proficiency in interpreting and utilizing structured data remains limited. Our investigation reveals a notable deficiency in LLMs’ability to process structured data, e.g., Chat-GPT lags behind state-of-the-art (SoTA) model by an average of 35%. To augment the Structured Knowledge Grounding (SKG) capabilities in LLMs, we have developed a comprehensive instruction tuning dataset comprising 1.1 million examples. Utilizing this dataset, we train a series of models, referred to as StructLM, based on the Code-LLaMA architecture, ranging from 7B to 34B parameters. Our StructLM series surpasses task-specific models on 14 out of 18 evaluated datasets and establishes new SoTA achievements on 7 SKG tasks. Furthermore, StructLM demonstrates exceptional generalization across 6 novel SKG tasks. Contrary to expectations, we observe that scaling model size offers marginal benefits, with StructLM-34B showing only slight improvements over StructLM-7B. This suggests that structured knowledge grounding is still a challenging task and requires more innovative design to push to a new level.
Dataset | Overall Length | Train | Test | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Input (avg) | Output (avg) | Count | Input (max) | Output (max) | #Trunc. | Count | Input (max) | Output (max) | #Trunc. | ||||
Held In | |||||||||||||
TabMWP | 207.8 | 4.5 | 23059 | 709 | 33 | 0 | 7686 | 703 | 31 | 0 | |||
ToTTo | 251.8 | 31.0 | 120761 | 2040 | 155 | 467 | 7700 | 2048 | 119 | 31 | |||
GrailQA | 281.0 | 44.1 | 44337 | 884 | 134 | 0 | 6463 | 546 | 123 | 0 | |||
SQL2Text | 122.3 | 18.1 | 5600 | 337 | 61 | 0 | 1034 | 245 | 38 | 0 | |||
MMQA | 656.2 | 7.7 | 15688 | 2047 | 146 | 234 | 1501 | 2048 | 94 | 11 | |||
Spider | 266.6 | 36.0 | 7000 | 1369 | 226 | 0 | 1034 | 453 | 146 | 0 | |||
KVRet | 573.4 | 17.1 | 6288 | 1217 | 161 | 0 | 807 | 1147 | 82 | 0 | |||
HybridQA | 700.4 | 6.8 | 62682 | 2047 | 91 | 200 | 3466 | 2048 | 79 | 6 | |||
SParC | 276.3 | 32.6 | 12059 | 1417 | 226 | 0 | 1625 | 467 | 146 | 0 | |||
CompWebQ | 1350.3 | 11.9 | 27639 | 2047 | 321 | 321 | 2816 | 2048 | 256 | 8 | |||
TabFact | 660.1 | 4.6 | 92283 | 2045 | 5 | 2 | 12779 | 1687 | 4 | 0 | |||
WikiTQ | 831.8 | 5.8 | 11321 | 2028 | 273 | 0 | 4344 | 2048 | 148 | 10 | |||
WikiSQL | 689.2 | 7.1 | 56355 | 2047 | 518 | 16 | 15878 | 2048 | 244 | 1 | |||
FeTaQA | 653.2 | 38.8 | 7326 | 1853 | 158 | 0 | 2003 | 1548 | 114 | 0 | |||
FEVEROUS | 799.3 | 3.4 | 40669 | 2047 | 5 | 2052 | 4285 | 2048 | 4 | 195 | |||
MultiWOZ | 777.2 | 154.5 | 56668 | 1656 | 196 | 0 | 7368 | 1344 | 185 | 0 | |||
DART | 133.7 | 30.3 | 62659 | 406 | 258 | 0 | 5097 | 261 | 109 | 0 | |||
Logic2Text | 166.1 | 26.9 | 8566 | 358 | 67 | 0 | 1092 | 347 | 60 | 0 | |||
MTOP | 961.0 | 34.4 | 15667 | 1002 | 215 | 0 | 4386 | 990 | 113 | 0 | |||
SlimOrca | 278.9 | 152.4 | 512069 | 2047 | 1808 | 0 | - | - | - | - | |||
Held Out | |||||||||||||
BIRD | 439.8 | 63.3 | 9428 | 1992 | 347 | 99 | 1534 | 1214 | 386 | 0 | |||
CoSQL | 287.4 | 34.9 | 9502 | 1640 | 226 | 0 | 1300 | 535 | 190 | 0 | |||
SQA | 656.9 | 34.9 | 12275 | 1812 | 1012 | 2 | 3011 | 1725 | 769 | 0 | |||
Infotabs | 276.9 | 3.7 | 16538 | 1009 | 5 | 0 | 5400 | 1105 | 4 | 0 | |||
WikiTableText | 149.6 | 27.4 | 10000 | 313 | 97 | 0 | 2000 | 226 | 89 | 0 | |||
Finqa | 1230.3 | 21.0 | 6251 | 2040 | 72 | 186 | 1147 | 2048 | 61 | 25 |
Purpose | Train | Eval | FT | Result |
---|---|---|---|---|
Schema task transfer | Spider, SParC, Logic2Text | Logic2Text | 89.47 | 89.93 |
KT task transfer | CompWebQ, WebQSP, GrailQa, Dart | Dart | 60.28 | 60.34 |
Table task transfer |
|
|
75.46 | 80.81 |
Summ. data type transfer | ToTTo, Dart | Dart | 60.28 | 61.42 |
QA data type transfer | CompWebQ, WikiSQL | WikiSQL | 85.49 | 86.36 |
Dataset | Metric | SoTA | ChatGPT | Base-Mistral | Base | ST | UL2 | TabLLaMA | USKG | StructLM(Ours) | Δ | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Size | - | - | - | 7B | 7B | 7B×18 | 20B | 7B | 3B×18 | 7B-Mistral | 7B | 13B | 34B | - |
Held In | ||||||||||||||
ToTTo | BLEU | 49.9 | 20.7 | 17.9 | 17.5 | 48.8 | - | 20.7 | 49.0 | 49.8 | 49.4 | 49.3 | 50.2 | +0.3 |
GrailQA | EM | 77.1 | 9.3 | 1.5 | 1.0 | 77.0 | - | - | 70.1 | 81.2 | 80.4 | 79.2 | 82.2 | +5.1 |
SQL2Text | Blec | 94.8 | 88.6 | 90.7 | 82.9 | 95.2 | - | - | 94.8 | 95.2 | 93.8 | 88.5 | 92.6 | +0.4 |
MMQA | F1 | 85.3 | 59.6 | 41.5 | 30.7 | 81.5 | - | - | 85.3 | 85.5 | 85.2 | 86.0 | 88.1 | +2.8 |
Spider | EM | 80.5 | 43.8 | 31.0 | 5.2 | 67.3 | - | - | 71.8 | 72.4 | 72.4 | 74.1 | 74.6 | -5.9 |
KVRet | All Micro | 67.9 | 52.9 | 34.4 | 39.5 | 70.9 | - | 48.7 | 67.9 | 72.2 | 72.6 | 69.5 | 69.3 | +4.7 |
HybridQA | Acc | 68.4 | 23.7 | 12.9 | 2.3 | 58.4 | 61.0 | 27.6 | 59.4 | 62.6 | 59.2 | 59.1 | 61.1 | -5.8 |
SParC | EM | 68.2 | 32.2 | 23.7 | 3.2 | 62.3 | - | - | 61.5 | 63.3 | 61.9 | 64.9 | 63.4 | -3.3 |
CompWebQ | Acc | 76.8 | 48.9 | 30.9 | 3.1 | 75.6 | 75.9 | - | 73.3 | 79.9 | 78.3 | 80.4 | 81.9 | +5.1 |
TabFact | Acc | 93.0 | 62.4 | 25.7 | 0.0 | 79.6 | 87.1 | 82.5 | 83.7 | 84.6 | 80.8 | 84.7 | 86.6 | -6.4 |
WikiTQ | All Ex | 65.9 | 24.8 | 6.7 | 0.2 | 45.7 | 54.6 | 31.6 | 49.3 | 56.8 | 50.1 | 53.4 | 55.7 | -9.1 |
WikiSQL | All Ex | 93.0 | 31.5 | 21.5 | 0.4 | 86.5 | 87.3 | 41.6 | 86.0 | 87.0 | 88.7 | 87.2 | 87.6 | -4.3 |
FeTaQA | BLEU | 39.0 | 7.4 | 13.7 | 5.6 | 33.8 | 35.8 | 39.0 | 33.4 | 37.5 | 36.0 | 35.6 | 37.5 | -1.5 |
FEVEROUS | Acc | 85.6 | 57.8 | 73.2 | 58.4 | 78.1 | 85.6 | 72.3 | 82.4 | 85.9 | 84.4 | 85.0 | 85.7 | +0.3 |
MultiWOZ | Joint Acc | 60.6 | 8.9 | 0.3 | 0.0 | 53.0 | - | - | 55.4 | 55.4 | 54.5 | 53.0 | 53.8 | -5.2 |
DART | BLEU | 52.0 | 59.0 | 47.4 | 54.6 | 60.3 | 50.4 | - | 46.7 | 63.2 | 62.2 | 61.4 | 61.8 | +11.2 |
Logic2Text | Blec | 95.3 | 78.5 | 81.5 | 59.1 | 89.5 | - | - | 91.4 | 89.5 | 88.9 | 90.1 | 89.1 | -5.2 |
MTOP | EM | 87.5 | 1.4 | 0.8 | 0.0 | 77.4 | 87.5 | - | 86.8 | 75.8 | 81.2 | 81.6 | 82.1 | -5.4 |
Average | 74.9 | 39.5 | 30.85 | 20.2 | 68.2 | - | - | 69.3 | 72.1 | 71.1 | 71.3 | 72.6 | -1.2 | |
Held Out | ||||||||||||||
BIRD | Acc | 36.6 | 21.8 | 11.5 | 0.0 | 24.4 | 1.0 | 4.2 | × | 22.8 | 22.3 | 22.8 | 24.7 | +2.9 |
CoSQL | EM | 58.3 | 33.7 | 26.5 | 0.2 | 52.4 | 5.1 | 5.4 | × | 52.8 | 49.8 | 52.2 | 55.0 | +21.3 |
SQA | Acc | 70.5 | 18.7 | 7.4 | 2.3 | 60.4 | 70.1 | 2.0 | × | 42.6 | 49.7 | 36.1 | 44.2 | +31 |
Infotabs | Acc | 75.6 | 46.9 | 49.1 | 40.2 | 68.7 | 70.3 | 35.5 | × | 47.2 | 55.3 | 58.1 | 61.8 | -8.5 |
WikiTableText | BLEU | 33.7 | 3.8 | 3.9 | 5.7 | 39.8 | 19.4 | 10.2 | × | 17.1 | 8.3 | 9.3 | 8.8 | -2.3 |
Finqa | Acc | 71.1 | 31.4 | 0.7 | 1.7 | 79.7 | 5.9 | 18.6 | × | 29.5 | 27.3 | 25.6 | 36.2 | +4.8 |
Average | 57.6 | 26.1 | 16.5 | 8.4 | 54.2 | 28.6 | 12.6 | × | 35.3 | 35.5 | 34.0 | 38.4 | +8.2 |
The overall evaluation results of our model against other baselines. Cells with "-" in the held-in part mean that the model did not train on this dataset, and results are not comparable. USKG models are overfit to the held-in dataset labels, and thus cannot generalize comparably. Cells in the held-out section with "*" are held-in results. SoTA results are copied from the original papers for reference. ST refers to the single-task fine-tuning result of CodeLlama-Instruct-7B on each dataset. BASE refers to the 1-shot performance of CodeLlama-Instruct-7B, BASE-Mistral refers to the same for Mistral-7B-Instruct-v0.2. Δ refers to the difference between StructLM and the best known result. score denotes the state-of-the-art score on specific tasks. All StructLM held-out results are 0-shot.
@misc{zhuang2024structlm,
title={StructLM: Towards Building Generalist Models for Structured Knowledge Grounding},
author={Alex Zhuang and Ge Zhang and Tianyu Zheng and Xinrun Du and Junjie Wang and Weiming Ren and Stephen W. Huang and Jie Fu and Xiang Yue and Wenhu Chen},
year={2024},
eprint={2402.16671},
archivePrefix={arXiv},
primaryClass={cs.CL}
}