StructLM: Towards Building Generalist Models for Structured Knowledge Grounding

1University of Waterloo, 2Multimodal Art Projection Research Community, 3Waseda University, 4HKUST, 5Ohio State University, 6harmony.ai 7Vector Institute
*Alex Zhuang and Ge Zhang are core contributors to this project.

Abstract

Structured data sources, such as tables, graphs, and databases, are ubiquitous knowledge sources. Despite the demonstrated capabilities of large language models (LLMs) on plain text, their proficiency in interpreting and utilizing structured data remains limited. Our investigation reveals a notable deficiency in LLMs’ability to process structured data, e.g., Chat-GPT lags behind state-of-the-art (SoTA) model by an average of 35%. To augment the Structured Knowledge Grounding (SKG) capabilities in LLMs, we have developed a comprehensive instruction tuning dataset comprising 1.1 million examples. Utilizing this dataset, we train a series of models, referred to as StructLM, based on the Code-LLaMA architecture, ranging from 7B to 34B parameters. Our StructLM series surpasses task-specific models on 14 out of 18 evaluated datasets and establishes new SoTA achievements on 7 SKG tasks. Furthermore, StructLM demonstrates exceptional generalization across 6 novel SKG tasks. Contrary to expectations, we observe that scaling model size offers marginal benefits, with StructLM-34B showing only slight improvements over StructLM-7B. This suggests that structured knowledge grounding is still a challenging task and requires more innovative design to push to a new level.

Overview of StructLM

This figure illustrates the prompting structure of StructLM, highlighting its capability to process various forms of structured data beyond linearized data tables, including linearized database schemas and knowledge graphs. StructLM is also assessed on held-out tasks that bear similarity to groups of held-in tasks, but also differences that must be overcome.

Our Datasets

Dataset Overall Length Train Test
Input (avg) Output (avg) Count Input (max) Output (max) #Trunc. Count Input (max) Output (max) #Trunc.
Held In
TabMWP 207.8 4.5 23059 709 33 0 7686 703 31 0
ToTTo 251.8 31.0 120761 2040 155 467 7700 2048 119 31
GrailQA 281.0 44.1 44337 884 134 0 6463 546 123 0
SQL2Text 122.3 18.1 5600 337 61 0 1034 245 38 0
MMQA 656.2 7.7 15688 2047 146 234 1501 2048 94 11
Spider 266.6 36.0 7000 1369 226 0 1034 453 146 0
KVRet 573.4 17.1 6288 1217 161 0 807 1147 82 0
HybridQA 700.4 6.8 62682 2047 91 200 3466 2048 79 6
SParC 276.3 32.6 12059 1417 226 0 1625 467 146 0
CompWebQ 1350.3 11.9 27639 2047 321 321 2816 2048 256 8
TabFact 660.1 4.6 92283 2045 5 2 12779 1687 4 0
WikiTQ 831.8 5.8 11321 2028 273 0 4344 2048 148 10
WikiSQL 689.2 7.1 56355 2047 518 16 15878 2048 244 1
FeTaQA 653.2 38.8 7326 1853 158 0 2003 1548 114 0
FEVEROUS 799.3 3.4 40669 2047 5 2052 4285 2048 4 195
MultiWOZ 777.2 154.5 56668 1656 196 0 7368 1344 185 0
DART 133.7 30.3 62659 406 258 0 5097 261 109 0
Logic2Text 166.1 26.9 8566 358 67 0 1092 347 60 0
MTOP 961.0 34.4 15667 1002 215 0 4386 990 113 0
SlimOrca 278.9 152.4 512069 2047 1808 0 - - - -
Held Out
BIRD 439.8 63.3 9428 1992 347 99 1534 1214 386 0
CoSQL 287.4 34.9 9502 1640 226 0 1300 535 190 0
SQA 656.9 34.9 12275 1812 1012 2 3011 1725 769 0
Infotabs 276.9 3.7 16538 1009 5 0 5400 1105 4 0
WikiTableText 149.6 27.4 10000 313 97 0 2000 226 89 0
Finqa 1230.3 21.0 6251 2040 72 186 1147 2048 61 25

Token sequence length statistics for each dataset in our train and test sets. Input and output statistics are in tokens. We report the number of examples which have been truncated in each dataset.

Ours results

StructLM can ground on structured and unstructured knowledge to respond to human queries. The previous SoTA was attained by many different task-specific models like TAPEX, USKG, TableLLaMA, BINDER-Codex, etc. StructLM (a single model) beats the previous SoTAs on seven out of eighteen SKG tasks.

Effect of different pretraining curricula on SKG finetuning performance in relevant task groupings.

Purpose Train Eval FT Result
Schema task transfer Spider, SParC, Logic2Text Logic2Text 89.47 89.93
KT task transfer CompWebQ, WebQSP, GrailQa, Dart Dart 60.28 60.34
Table task transfer
  • FetaQA, HybridQA, WikiTQ,
  • TabMWP, ToTTo, MMQA,
  • WikiSQL, KVRet, Tab Fact,
  • Feverous, Infotabs
  • TabFact,
  • Feverous,
  • Infotabs
75.46 80.81
Summ. data type transfer ToTTo, Dart Dart 60.28 61.42
QA data type transfer CompWebQ, WikiSQL WikiSQL 85.49 86.36
Cross task and cross datatype transfer results. FT is an average of single-task performance over the datasets in the Eval column.

Effect of general instruction-following data on held-out SKG dataset performance. Performance is measured as the average over evaluation metrics across all tasks within held-in or held-out groups. Note that the held-in performance experiences a milder dip compared to the held-out performance gains.

The evaluation results of our model against other baselines

Dataset Metric SoTA ChatGPT Base-Mistral Base ST UL2 TabLLaMA USKG StructLM(Ours) Δ
Size - - - 7B 7B 7B×18 20B 7B 3B×18 7B-Mistral 7B 13B 34B -
Held In
ToTTo BLEU 49.9 20.7 17.9 17.5 48.8 - 20.7 49.0 49.8 49.4 49.3 50.2 +0.3
GrailQA EM 77.1 9.3 1.5 1.0 77.0 - - 70.1 81.2 80.4 79.2 82.2 +5.1
SQL2Text Blec 94.8 88.6 90.7 82.9 95.2 - - 94.8 95.2 93.8 88.5 92.6 +0.4
MMQA F1 85.3 59.6 41.5 30.7 81.5 - - 85.3 85.5 85.2 86.0 88.1 +2.8
Spider EM 80.5 43.8 31.0 5.2 67.3 - - 71.8 72.4 72.4 74.1 74.6 -5.9
KVRet All Micro 67.9 52.9 34.4 39.5 70.9 - 48.7 67.9 72.2 72.6 69.5 69.3 +4.7
HybridQA Acc 68.4 23.7 12.9 2.3 58.4 61.0 27.6 59.4 62.6 59.2 59.1 61.1 -5.8
SParC EM 68.2 32.2 23.7 3.2 62.3 - - 61.5 63.3 61.9 64.9 63.4 -3.3
CompWebQ Acc 76.8 48.9 30.9 3.1 75.6 75.9 - 73.3 79.9 78.3 80.4 81.9 +5.1
TabFact Acc 93.0 62.4 25.7 0.0 79.6 87.1 82.5 83.7 84.6 80.8 84.7 86.6 -6.4
WikiTQ All Ex 65.9 24.8 6.7 0.2 45.7 54.6 31.6 49.3 56.8 50.1 53.4 55.7 -9.1
WikiSQL All Ex 93.0 31.5 21.5 0.4 86.5 87.3 41.6 86.0 87.0 88.7 87.2 87.6 -4.3
FeTaQA BLEU 39.0 7.4 13.7 5.6 33.8 35.8 39.0 33.4 37.5 36.0 35.6 37.5 -1.5
FEVEROUS Acc 85.6 57.8 73.2 58.4 78.1 85.6 72.3 82.4 85.9 84.4 85.0 85.7 +0.3
MultiWOZ Joint Acc 60.6 8.9 0.3 0.0 53.0 - - 55.4 55.4 54.5 53.0 53.8 -5.2
DART BLEU 52.0 59.0 47.4 54.6 60.3 50.4 - 46.7 63.2 62.2 61.4 61.8 +11.2
Logic2Text Blec 95.3 78.5 81.5 59.1 89.5 - - 91.4 89.5 88.9 90.1 89.1 -5.2
MTOP EM 87.5 1.4 0.8 0.0 77.4 87.5 - 86.8 75.8 81.2 81.6 82.1 -5.4
Average 74.9 39.5 30.85 20.2 68.2 - - 69.3 72.1 71.1 71.3 72.6 -1.2
Held Out
BIRD Acc 36.6 21.8 11.5 0.0 24.4 1.0 4.2 × 22.8 22.3 22.8 24.7 +2.9
CoSQL EM 58.3 33.7 26.5 0.2 52.4 5.1 5.4 × 52.8 49.8 52.2 55.0 +21.3
SQA Acc 70.5 18.7 7.4 2.3 60.4 70.1 2.0 × 42.6 49.7 36.1 44.2 +31
Infotabs Acc 75.6 46.9 49.1 40.2 68.7 70.3 35.5 × 47.2 55.3 58.1 61.8 -8.5
WikiTableText BLEU 33.7 3.8 3.9 5.7 39.8 19.4 10.2 × 17.1 8.3 9.3 8.8 -2.3
Finqa Acc 71.1 31.4 0.7 1.7 79.7 5.9 18.6 × 29.5 27.3 25.6 36.2 +4.8
Average 57.6 26.1 16.5 8.4 54.2 28.6 12.6 × 35.3 35.5 34.0 38.4 +8.2

The overall evaluation results of our model against other baselines. Cells with "-" in the held-in part mean that the model did not train on this dataset, and results are not comparable. USKG models are overfit to the held-in dataset labels, and thus cannot generalize comparably. Cells in the held-out section with "*" are held-in results. SoTA results are copied from the original papers for reference. ST refers to the single-task fine-tuning result of CodeLlama-Instruct-7B on each dataset. BASE refers to the 1-shot performance of CodeLlama-Instruct-7B, BASE-Mistral refers to the same for Mistral-7B-Instruct-v0.2. Δ refers to the difference between StructLM and the best known result. score denotes the state-of-the-art score on specific tasks. All StructLM held-out results are 0-shot.

BibTeX

Please kindly cite our paper if you use our code, data, models or results:

@misc{zhuang2024structlm,
    title={StructLM: Towards Building Generalist Models for Structured Knowledge Grounding}, 
    author={Alex Zhuang and Ge Zhang and Tianyu Zheng and Xinrun Du and Junjie Wang and Weiming Ren and Stephen W. Huang and Jie Fu and Xiang Yue and Wenhu Chen},
    year={2024},
    eprint={2402.16671},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}