StructLM: Towards Building Generalist Models for Structured Knowledge Grounding

¹Alex Zhuang^*, ^1,2,7Ge Zhang^*, ²Tianyu Zheng, ²Xinrun Du, ^3,4Junjie Wang, †^1,7Weiming Ren, ⁶Stephen W. Huang, ^2,4Jie Fu, ⁵Xiang Yue, ^1,2,7Wenhu Chen

¹University of Waterloo, ²Multimodal Art Projection Research Community, ³Waseda University, ⁴HKUST, ⁵Ohio State University, ⁶harmony.ai ⁷Vector Institute
^*Alex Zhuang and Ge Zhang are core contributors to this project.

a5zhuang@uwaterloo.ca, ge.zhang@uwaterloo.ca, jiefu@ust.hk, wenhuchen@uwaterloo.ca

🤗 Dataset 🤗 Models Code arXiv

Abstract

Structured data sources, such as tables, graphs, and databases, are ubiquitous knowledge sources. Despite the demonstrated capabilities of large language models (LLMs) on plain text, their proficiency in interpreting and utilizing structured data remains limited. Our investigation reveals a notable deficiency in LLMs’ability to process structured data, e.g., Chat-GPT lags behind state-of-the-art (SoTA) model by an average of 35%. To augment the Structured Knowledge Grounding (SKG) capabilities in LLMs, we have developed a comprehensive instruction tuning dataset comprising 1.1 million examples. Utilizing this dataset, we train a series of models, referred to as StructLM, based on the Code-LLaMA architecture, ranging from 7B to 34B parameters. Our StructLM series surpasses task-specific models on 14 out of 18 evaluated datasets and establishes new SoTA achievements on 7 SKG tasks. Furthermore, StructLM demonstrates exceptional generalization across 6 novel SKG tasks. Contrary to expectations, we observe that scaling model size offers marginal benefits, with StructLM-34B showing only slight improvements over StructLM-7B. This suggests that structured knowledge grounding is still a challenging task and requires more innovative design to push to a new level.

Overview of StructLM

This figure illustrates the prompting structure of StructLM, highlighting its capability to process various forms of structured data beyond linearized data tables, including linearized database schemas and knowledge graphs. StructLM is also assessed on held-out tasks that bear similarity to groups of held-in tasks, but also differences that must be overcome.

Our Datasets

Dataset	Overall Length		Train				Test
Dataset	Input (avg)	Output (avg)	Count	Input (max)	Output (max)	#Trunc.	Count	Input (max)	Output (max)	#Trunc.
Held In
TabMWP	207.8	4.5	23059	709	33	0	7686	703	31	0
ToTTo	251.8	31.0	120761	2040	155	467	7700	2048	119	31
GrailQA	281.0	44.1	44337	884	134	0	6463	546	123	0
SQL2Text	122.3	18.1	5600	337	61	0	1034	245	38	0
MMQA	656.2	7.7	15688	2047	146	234	1501	2048	94	11
Spider	266.6	36.0	7000	1369	226	0	1034	453	146	0
KVRet	573.4	17.1	6288	1217	161	0	807	1147	82	0
HybridQA	700.4	6.8	62682	2047	91	200	3466	2048	79	6
SParC	276.3	32.6	12059	1417	226	0	1625	467	146	0
CompWebQ	1350.3	11.9	27639	2047	321	321	2816	2048	256	8
TabFact	660.1	4.6	92283	2045	5	2	12779	1687	4	0
WikiTQ	831.8	5.8	11321	2028	273	0	4344	2048	148	10
WikiSQL	689.2	7.1	56355	2047	518	16	15878	2048	244	1
FeTaQA	653.2	38.8	7326	1853	158	0	2003	1548	114	0
FEVEROUS	799.3	3.4	40669	2047	5	2052	4285	2048	4	195
MultiWOZ	777.2	154.5	56668	1656	196	0	7368	1344	185	0
DART	133.7	30.3	62659	406	258	0	5097	261	109	0
Logic2Text	166.1	26.9	8566	358	67	0	1092	347	60	0
MTOP	961.0	34.4	15667	1002	215	0	4386	990	113	0
SlimOrca	278.9	152.4	512069	2047	1808	0	-	-	-	-
Held Out
BIRD	439.8	63.3	9428	1992	347	99	1534	1214	386	0
CoSQL	287.4	34.9	9502	1640	226	0	1300	535	190	0
SQA	656.9	34.9	12275	1812	1012	2	3011	1725	769	0
Infotabs	276.9	3.7	16538	1009	5	0	5400	1105	4	0
WikiTableText	149.6	27.4	10000	313	97	0	2000	226	89	0
Finqa	1230.3	21.0	6251	2040	72	186	1147	2048	61	25

Token sequence length statistics for each dataset in our train and test sets. Input and output statistics are in tokens. We report the number of examples which have been truncated in each dataset.

Ours results

StructLM can ground on structured and unstructured knowledge to respond to human queries. The previous SoTA was attained by many different task-specific models like TAPEX, USKG, TableLLaMA, BINDER-Codex, etc. StructLM (a single model) beats the previous SoTAs on seven out of eighteen SKG tasks.

Effect of different pretraining curricula on SKG finetuning performance in relevant task groupings.

Purpose	Train	Eval	FT	Result
Schema task transfer	Spider, SParC, Logic2Text	Logic2Text	89.47	89.93
KT task transfer	CompWebQ, WebQSP, GrailQa, Dart	Dart	60.28	60.34
Table task transfer	FetaQA, HybridQA, WikiTQ, TabMWP, ToTTo, MMQA, WikiSQL, KVRet, Tab Fact, Feverous, Infotabs	TabFact, Feverous, Infotabs	75.46	80.81
Summ. data type transfer	ToTTo, Dart	Dart	60.28	61.42
QA data type transfer	CompWebQ, WikiSQL	WikiSQL	85.49	86.36

Cross task and cross datatype transfer results. FT is an average of single-task performance over the datasets in the Eval column.

Effect of general instruction-following data on held-out SKG dataset performance. Performance is measured as the average over evaluation metrics across all tasks within held-in or held-out groups. Note that the held-in performance experiences a milder dip compared to the held-out performance gains.

The evaluation results of our model against other baselines

Dataset	Metric	SoTA	ChatGPT	Base-Mistral	Base	ST	UL2	TabLLaMA	USKG	StructLM(Ours)				Δ
Size	-	-	-	7B	7B	7B×18	20B	7B	3B×18	7B-Mistral	7B	13B	34B	-
Held In
ToTTo	BLEU	49.9	20.7	17.9	17.5	48.8	-	20.7	49.0	49.8	49.4	49.3	50.2	+0.3
GrailQA	EM	77.1	9.3	1.5	1.0	77.0	-	-	70.1	81.2	80.4	79.2	82.2	+5.1
SQL2Text	Blec	94.8	88.6	90.7	82.9	95.2	-	-	94.8	95.2	93.8	88.5	92.6	+0.4
MMQA	F1	85.3	59.6	41.5	30.7	81.5	-	-	85.3	85.5	85.2	86.0	88.1	+2.8
Spider	EM	80.5	43.8	31.0	5.2	67.3	-	-	71.8	72.4	72.4	74.1	74.6	-5.9
KVRet	All Micro	67.9	52.9	34.4	39.5	70.9	-	48.7	67.9	72.2	72.6	69.5	69.3	+4.7
HybridQA	Acc	68.4	23.7	12.9	2.3	58.4	61.0	27.6	59.4	62.6	59.2	59.1	61.1	-5.8
SParC	EM	68.2	32.2	23.7	3.2	62.3	-	-	61.5	63.3	61.9	64.9	63.4	-3.3
CompWebQ	Acc	76.8	48.9	30.9	3.1	75.6	75.9	-	73.3	79.9	78.3	80.4	81.9	+5.1
TabFact	Acc	93.0	62.4	25.7	0.0	79.6	87.1	82.5	83.7	84.6	80.8	84.7	86.6	-6.4
WikiTQ	All Ex	65.9	24.8	6.7	0.2	45.7	54.6	31.6	49.3	56.8	50.1	53.4	55.7	-9.1
WikiSQL	All Ex	93.0	31.5	21.5	0.4	86.5	87.3	41.6	86.0	87.0	88.7	87.2	87.6	-4.3
FeTaQA	BLEU	39.0	7.4	13.7	5.6	33.8	35.8	39.0	33.4	37.5	36.0	35.6	37.5	-1.5
FEVEROUS	Acc	85.6	57.8	73.2	58.4	78.1	85.6	72.3	82.4	85.9	84.4	85.0	85.7	+0.3
MultiWOZ	Joint Acc	60.6	8.9	0.3	0.0	53.0	-	-	55.4	55.4	54.5	53.0	53.8	-5.2
DART	BLEU	52.0	59.0	47.4	54.6	60.3	50.4	-	46.7	63.2	62.2	61.4	61.8	+11.2
Logic2Text	Blec	95.3	78.5	81.5	59.1	89.5	-	-	91.4	89.5	88.9	90.1	89.1	-5.2
MTOP	EM	87.5	1.4	0.8	0.0	77.4	87.5	-	86.8	75.8	81.2	81.6	82.1	-5.4
Average		74.9	39.5	30.85	20.2	68.2	-	-	69.3	72.1	71.1	71.3	72.6	-1.2
Held Out
BIRD	Acc	36.6	21.8	11.5	0.0	24.4	1.0	4.2	×	22.8	22.3	22.8	24.7	+2.9
CoSQL	EM	58.3	33.7	26.5	0.2	52.4	5.1	5.4	×	52.8	49.8	52.2	55.0	+21.3
SQA	Acc	70.5	18.7	7.4	2.3	60.4	70.1	2.0	×	42.6	49.7	36.1	44.2	+31
Infotabs	Acc	75.6	46.9	49.1	40.2	68.7	70.3	35.5	×	47.2	55.3	58.1	61.8	-8.5
WikiTableText	BLEU	33.7	3.8	3.9	5.7	39.8	19.4	10.2	×	17.1	8.3	9.3	8.8	-2.3
Finqa	Acc	71.1	31.4	0.7	1.7	79.7	5.9	18.6	×	29.5	27.3	25.6	36.2	+4.8
Average		57.6	26.1	16.5	8.4	54.2	28.6	12.6	×	35.3	35.5	34.0	38.4	+8.2

The overall evaluation results of our model against other baselines. Cells with "-" in the held-in part mean that the model did not train on this dataset, and results are not comparable. USKG models are overfit to the held-in dataset labels, and thus cannot generalize comparably. Cells in the held-out section with "*" are held-in results. SoTA results are copied from the original papers for reference. ST refers to the single-task fine-tuning result of CodeLlama-Instruct-7B on each dataset. BASE refers to the 1-shot performance of CodeLlama-Instruct-7B, BASE-Mistral refers to the same for Mistral-7B-Instruct-v0.2. Δ refers to the difference between StructLM and the best known result. score denotes the state-of-the-art score on specific tasks. All StructLM held-out results are 0-shot.