TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

^♦Dongfu Jiang^*, ^♣Yishan Li^*, ^♦Ge Zhang, ^♡Wenhao Huang, ^♠Bill Yuchen Lin, ^♦Wenhu Chen

^♦University of Waterloo, ^♣Tsinghua University, ^♡IN.AI, ^♠Allen Institute for AI,
^*Dongfu Jiang and Yishan Li have led the project and contributed equally to this project.
dongfu.jiang@uwaterloo.ca, liyisha19@mails.tsinghua.edu.cn , wenhuchen@uwaterloo.ca

🤗 Dataset 🤗 Models 🤗 Demo Code arXiv

Abstract

We present 🐯TIGERScore, a Trained metric that follows Instruction Guidance to perform Explainable, and Reference-free evaluation over a wide spectrum of text generation tasks. Different from other automatic evaluation methods that only provide arcane scores, TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mistakes in the generated text. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset 📏MetricInstruct which covers 6 text generation tasks and 23 text generation datasets. The dataset consists of 42K quadruple in the form of (instruction, input, system output → error analysis). We collected the `system outputs' through from a large variety of models to cover different types of errors. To quantitatively assess our metric, we evaluate its correlation with human ratings on 5 held-in datasets, 2 held-out datasets and show that TIGERScore can achieve the open-source SoTA correlation with human ratings across these datasets and almost approaches GPT-4 evaluator. As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. To further qualitatively assess the rationale generated by our metric, we conduct human evaluation on the generated explanations and found that the explanations are 70.8% accurate. Through these experimental results, we believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task.

Figure 1: The upper part shows the input and output format of TIGERScore. The lower part shows the Kendall's correlation of different metrics w.r.t human ratings, from which we can see TIGERScore achieves the SoTA correlation on most generation tasks.

🤔 Existing automatic metrics are lagging and suffer from issues like: 1) Dependency on references, 2) Limited to specific domains, 3) Lack of attribution. Contrary to them, 🐯TIGERScore is designed to be driven by natural language instruction and provide detailed error analysis to pinpoint the mistakes in the generated text.
🔧 Specifically, TIGERScore takes an instruction, an associated input context along with a hypothesis output that might contain errors as input triplet (instruction, input context, hypo output). Then, TIGERScore will evaluate this hypothesis output and list several errors, each consisting of the (error location, aspect, explanation, penalty scores). The sum of the reduced scores is taken as the overall rating of this output.
👏 In summary, TIGERScore is built upon three design criteria: 1) Driven by instructions, 2) Reference-free, 3) Highly explanable. We believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task.

Our Dataset: MetricInstruct

Figure 2: Overall pipeline of constructing MetricInstruct through the two-channel collection. The main difference is that the real-world channel collect outputs from real-world models, while the synthetic channel askes GPT-4 to synthesize a error output based on the provided reference.

We present the MetricInstruct dataset, which is employed to fine-tune TIGERScore. The three underlying criteria for dataset construction are:

🥗Dataset diversity: we choose 23 distinctive datasets as the source context to cover enough generation tasks.
🎯Error coverage: we take system outputs generated from 50+ text generation systems to cover all types of errors and guarantee a balanced distribution.
🔎Quality ensurance: to ensure MetricInstruct is tailored to gather in-depth error analysis, we sourced it by prompting OpenAI GPT models and then filtered through different heuristics to eliminate low-quality error analysis.

Overall Results (Kendall Correlation)

Tasks→	Summarization	Translation	Data2Text	Long-form QA	MathQA	Instruction-Following	Story-Gen	Average
GPT-3.5-turbo (few-shot)	30.45	32.30	30.38	20.91	58.57	17.73	3.26	27.65
GPT-4 (zero-shot)	29.32	35.38	32.26	35.85	46.63	49.50	25.69	36.38
BLEU	8.71	14.50	23.13	7.73	17.25	35.92	-0.89	15.19
ROUGE-2f	10.67	13.19	24.74	11.73	18.07	34.59	1.78	16.40
InstructScore	20.86	40.44	30.21	15.64	-3.87	13.87	13.50	18.66
GPTScore-ref	10.80	18.74	27.47	22.13	14.86	25.40	12.78	18.88
BARTScore-cnn(hypo-ref)	10.00	21.06	27.04	20.67	19.07	24.70	18.58	20.16
BARTScore-para (hypo-ref)	10.41	24.90	28.42	20.24	14.10	26.13	12.11	19.47
BERTScore	17.39	31.57	30.74	17.70	9.41	35.61	2.00	20.63
BLEURT	12.69	36.12	34.48	21.11	2.88	27.94	19.18	22.34
UniEval(summ)	35.89	16.08	28.56	29.32	16.15	11.93	31.22	24.17
COMET-22	25.01	42.79	23.43	24.66	-4.52	36.17	27.52	25.01
BARTScore-para (src-hypo)	29.12	7.01	22.32	18.80	-2.21	4.26	14.15	13.35
BARTScore-cnn (src-hypo)	26.63	9.40	23.69	28.93	1.23	19.09	23.29	18.89
Llama-2-13b-chat-0-shot	25.22	11.79	23.45	15.96	1.08	19.50	21.52	16.93
COMETKiwi	11.87	36.37	19.08	12.23	-9.38	26.46	12.78	15.63
GPTScore-src	28.20	6.50	19.81	27.64	11.64	20.04	16.36	18.60
TIGERScore-7B (ours)	28.79	33.65	32.44	33.93	19.98	38.13	29.72	30.95
TIGERScore-13B (ours)	31.29	36.50	36.43	33.17	21.58	41.84	35.33	33.73
Δ (ours - best reference-free)	+2	+0	+13	+4	+10	+15	+14	+15
Δ (ours - best reference-based)	-4	-6	+2	+4	+2	+5	+4	+8

Table 1:The Kendall correlation results of all the baseline metrics and TIGERScore on the evaluation datasets shown in Table 3. For each task, the metric with the highest correlation to average performance is highlighted in bold.

Our results highlight the significant advantages of TIGERScore over other reference-free metrics. Notably, TIGERScore has surpassed all other reference-free metrics in Kendall correlation.
Compared with reference-based baselines, TIGERScore generally outperforms most reference-based metrics. However, it does score lower than some task-specific metrics like UniEval (summ) in summarization, COMET-22 in translation. We consider these discrepancies acceptable, because the compared metrics are reference-based and all specifically fine-tuned for a single task.
Furthermore, TIGERScore achieves significantly higher overall correlation, compared with the cheap API substitute GPT-3.5-Turbo (few-shot) and the Llama-2-13b-chat (0-shot). TIGERScore-13b can achieve comparable correlation performance with GPT-4 (0-shot), even higher on some tasks like summarization, translation, data2text, and story generation.

Human Evaluation

The reasonableness of the error analysis provided by TIGERScore was assessed through a random selection of 50 error analyses from each evaluation dataset.

These error analyses are then evaluated by human experts who rated them from the following perspectives:

Reasonableness: The human experts directly pointed out which errors are problematic in error analyses, examining whether the analysis contained hallucination or illogical reasoning.
Comprehensiveness: The human experts carefully review the source, output, and error analyses to determine if there are any additional errors unnoticed by TIGERScore. Based on human experts' analysis, they give a score on a scale of 1 to 4, specifically focused on identifying potential errors that may have been overlooked in the original analysis conducted by TIGERScore.
Effectiveness: The revision suggestions in error analyses are evaluated by human experts, on a scale of 1 to 5, to determine their appropriateness and effectiveness in enhancing the output quantity.
Overall: The Human experts further assign an overall score on a scale of 1 to 5 based on the reasonableness, comprehensiveness, and effectiveness of the error analysis.

Aspects	Explaination Error?		Overlooked Errors				Helpful of Revision Suggestions					Overall Rating
Task↓ Rate→	Correct	Wrong	1	2	3	4	1	2	3	4	5	1	2	3	4	5
Summ	70	35	2	17	15	16	6	4	19	7	14	3	10	17	7	13
Trans	54	25	3	8	17	22	2	7	17	6	18	3	6	15	9	17
D2T	19	21	1	8	10	31	11	8	9	3	19	11	10	4	7	18
LF-QA	42	19	4	10	11	25	5	8	14	7	16	6	8	10	6	20
MathQA	39	26	5	12	12	21	5	7	19	5	14	4	9	10	13	14
Instruct	5	9	5	5	8	32	21	3	5	2	19	9	4	3	7	27
Story-Gen	66	29	7	16	13	14	7	6	16	10	11	7	12	11	9	11
Total	295	164	27	76	86	161	57	43	99	40	111	43	59	70	58	120

Table 4:Human evaluation results, the first question is asked per error in error analyses, and the others are per sample. Superior performance is indicated by higher numerical values. The most-voted rate of each task for each human evaluation aspect is bolded.

Two Channel Data Collection

The different perfermance among different data source

Figure 3: Investigation of the influence of Real-World & Synthesic mix training on the 13B model.

To investigate the significance of our two-channel data in contributing to the strong performance of TIGERScore, we conducted a series of experiments with the following setups:

TIGERScore (MetricInstruct-Real-World), abbreviated as ReW, which aimed to assess the quality of our real-world data and evaluate how to represent real-world data.
TIGERScore (MetricInstruct-Synthesic), abbreviated as Syn, which aimed to evaluate the impact of our synthetic data in enhancing quality as the complement.
TIGERScore (MetricInstruct-Mix), abbreviated as Mix, our official model, to assess the combined effect of using both types of data during training.

Significantly, the Mix model outperformed both ReW and Syn in 5 of the 7 tasks. It also achieves an average correlation approximately 20% higher than ReW and 31% higher than Syn. Although there are slight decreases in correlation for the D2T and Story-Gen tasks when mixing data, the absolute correlation scores for them are around 40, which is considerably high. We think it's an acceptable compromise for the overall model performance. These findings demonstrate the successful complementary nature of our two-channel data, leading to enhanced performance when combined.