TIGERScore Logo TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks

University of Waterloo, Tsinghua University, IN.AI, Allen Institute for AI,
*Dongfu Jiang and Yishan Li have led the project and contributed equally to this project.

dongfu.jiang@uwaterloo.ca, liyisha19@mails.tsinghua.edu.cn , wenhuchen@uwaterloo.ca

Abstract

We present 🐯TIGERScore, a Trained metric that follows Instruction Guidance to perform Explainable, and Reference-free evaluation over a wide spectrum of text generation tasks. Different from other automatic evaluation methods that only provide arcane scores, TIGERScore is guided by natural language instruction to provide error analysis to pinpoint the mistakes in the generated text. Our metric is based on LLaMA-2, trained on our meticulously curated instruction-tuning dataset 📏MetricInstruct which covers 6 text generation tasks and 23 text generation datasets. The dataset consists of 42K quadruple in the form of (instruction, input, system output → error analysis). We collected the `system outputs' through from a large variety of models to cover different types of errors. To quantitatively assess our metric, we evaluate its correlation with human ratings on 5 held-in datasets, 2 held-out datasets and show that TIGERScore can achieve the open-source SoTA correlation with human ratings across these datasets and almost approaches GPT-4 evaluator. As a reference-free metric, its correlation can even surpass the best existing reference-based metrics. To further qualitatively assess the rationale generated by our metric, we conduct human evaluation on the generated explanations and found that the explanations are 70.8% accurate. Through these experimental results, we believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task.

The perfermance of TIGERScore

Figure 1: The upper part shows the input and output format of TIGERScore. The lower part shows the Kendall's correlation of different metrics w.r.t human ratings, from which we can see TIGERScore achieves the SoTA correlation on most generation tasks.

  1. 🤔 Existing automatic metrics are lagging and suffer from issues like: 1) Dependency on references, 2) Limited to specific domains, 3) Lack of attribution. Contrary to them, 🐯TIGERScore is designed to be driven by natural language instruction and provide detailed error analysis to pinpoint the mistakes in the generated text.
  2. 🔧 Specifically, TIGERScore takes an instruction, an associated input context along with a hypothesis output that might contain errors as input triplet (instruction, input context, hypo output). Then, TIGERScore will evaluate this hypothesis output and list several errors, each consisting of the (error location, aspect, explanation, penalty scores). The sum of the reduced scores is taken as the overall rating of this output.
  3. 👏 In summary, TIGERScore is built upon three design criteria: 1) Driven by instructions, 2) Reference-free, 3) Highly explanable. We believe TIGERScore demonstrates the possibility of building universal explainable metrics to evaluate any text generation task.

Our Dataset: MetricInstruct

The pipeline of MetricInstruct

Figure 2: Overall pipeline of constructing MetricInstruct through the two-channel collection. The main difference is that the real-world channel collect outputs from real-world models, while the synthetic channel askes GPT-4 to synthesize a error output based on the provided reference.

We present the MetricInstruct dataset, which is employed to fine-tune TIGERScore. The three underlying criteria for dataset construction are:

  1. 🥗Dataset diversity: we choose 23 distinctive datasets as the source context to cover enough generation tasks.
  2. 🎯Error coverage: we take system outputs generated from 50+ text generation systems to cover all types of errors and guarantee a balanced distribution.
  3. 🔎Quality ensurance: to ensure MetricInstruct is tailored to gather in-depth error analysis, we sourced it by prompting OpenAI GPT models and then filtered through different heuristics to eliminate low-quality error analysis.

Overall Results (Kendall Correlation)

Tasks→ Summarization Translation Data2Text Long-form QA MathQA Instruction-Following Story-Gen Average
GPT-3.5-turbo (few-shot) 30.45 32.30 30.38 20.91 58.57 17.73 3.26 27.65
GPT-4 (zero-shot) 29.32 35.38 32.26 35.85 46.63 49.50 25.69 36.38
BLEU 8.71 14.50 23.13 7.73 17.25 35.92 -0.89 15.19
ROUGE-2f 10.67 13.19 24.74 11.73 18.07 34.59 1.78 16.40
InstructScore 20.86 40.44 30.21 15.64 -3.87 13.87 13.50 18.66
GPTScore-ref 10.80 18.74 27.47 22.13 14.86 25.40 12.78 18.88
BARTScore-cnn(hypo-ref) 10.00 21.06 27.04 20.67 19.07 24.70 18.58 20.16
BARTScore-para (hypo-ref) 10.41 24.90 28.42 20.24 14.10 26.13 12.11 19.47
BERTScore 17.39 31.57 30.74 17.70 9.41 35.61 2.00 20.63
BLEURT 12.69 36.12 34.48 21.11 2.88 27.94 19.18 22.34
UniEval(summ) 35.89 16.08 28.56 29.32 16.15 11.93 31.22 24.17
COMET-22 25.01 42.79 23.43 24.66 -4.52 36.17 27.52 25.01
BARTScore-para (src-hypo) 29.12 7.01 22.32 18.80 -2.21 4.26 14.15 13.35
BARTScore-cnn (src-hypo) 26.63 9.40 23.69 28.93 1.23 19.09 23.29 18.89
Llama-2-13b-chat-0-shot 25.22 11.79 23.45 15.96 1.08 19.50 21.52 16.93
COMETKiwi 11.87 36.37 19.08 12.23 -9.38 26.46 12.78 15.63
GPTScore-src 28.20 6.50 19.81 27.64 11.64 20.04 16.36 18.60
TIGERScore-7B (ours) 28.79 33.65 32.44 33.93 19.98 38.13 29.72 30.95
TIGERScore-13B (ours) 31.29 36.50 36.43 33.17 21.58 41.84 35.33 33.73
Δ (ours - best reference-free) +2 +0 +13 +4 +10 +15 +14 +15
Δ (ours - best reference-based) -4 -6 +2 +4 +2 +5 +4 +8
Table 1:The Kendall correlation results of all the baseline metrics and TIGERScore on the evaluation datasets shown in Table 3. For each task, the metric with the highest correlation to average performance is highlighted in bold.

  1. Our results highlight the significant advantages of TIGERScore over other reference-free metrics. Notably, TIGERScore has surpassed all other reference-free metrics in Kendall correlation.
  2. Compared with reference-based baselines, TIGERScore generally outperforms most reference-based metrics. However, it does score lower than some task-specific metrics like UniEval (summ) in summarization, COMET-22 in translation. We consider these discrepancies acceptable, because the compared metrics are reference-based and all specifically fine-tuned for a single task.
  3. Furthermore, TIGERScore achieves significantly higher overall correlation, compared with the cheap API substitute GPT-3.5-Turbo (few-shot) and the Llama-2-13b-chat (0-shot). TIGERScore-13b can achieve comparable correlation performance with GPT-4 (0-shot), even higher on some tasks like summarization, translation, data2text, and story generation.

Human Evaluation

The reasonableness of the error analysis provided by TIGERScore was assessed through a random selection of 50 error analyses from each evaluation dataset.

These error analyses are then evaluated by human experts who rated them from the following perspectives:

  1. Reasonableness: The human experts directly pointed out which errors are problematic in error analyses, examining whether the analysis contained hallucination or illogical reasoning.
  2. Comprehensiveness: The human experts carefully review the source, output, and error analyses to determine if there are any additional errors unnoticed by TIGERScore. Based on human experts' analysis, they give a score on a scale of 1 to 4, specifically focused on identifying potential errors that may have been overlooked in the original analysis conducted by TIGERScore.
  3. Effectiveness: The revision suggestions in error analyses are evaluated by human experts, on a scale of 1 to 5, to determine their appropriateness and effectiveness in enhancing the output quantity.
  4. Overall: The Human experts further assign an overall score on a scale of 1 to 5 based on the reasonableness, comprehensiveness, and effectiveness of the error analysis.


Aspects Explaination Error? Overlooked Errors Helpful of Revision Suggestions Overall Rating
Task↓ Rate→ Correct Wrong 1 2 3 4 1 2 3 4 5 1 2 3 4 5
Summ 70 35 2 17 15 16 6 4 19 7 14 3 10 17 7 13
Trans 54 25 3 8 17 22 2 7 17 6 18 3 6 15 9 17
D2T 19 21 1 8 10 31 11 8 9 3 19 11 10 4 7 18
LF-QA 42 19 4 10 11 25 5 8 14 7 16 6 8 10 6 20
MathQA 39 26 5 12 12 21 5 7 19 5 14 4 9 10 13 14
Instruct 5 9 5 5 8 32 21 3 5 2 19 9 4 3 7 27
Story-Gen 66 29 7 16 13 14 7 6 16 10 11 7 12 11 9 11
Total 295 164 27 76 86 161 57 43 99 40 111 43 59 70 58 120

Table 4:Human evaluation results, the first question is asked per error in error analyses, and the others are per sample. Superior performance is indicated by higher numerical values. The most-voted rate of each task for each human evaluation aspect is bolded.

Two Channel Data Collection

The different perfermance among different data source

Figure 3: Investigation of the influence of Real-World & Synthesic mix training on the 13B model.

To investigate the significance of our two-channel data in contributing to the strong performance of TIGERScore, we conducted a series of experiments with the following setups:

  1. TIGERScore (MetricInstruct-Real-World), abbreviated as ReW, which aimed to assess the quality of our real-world data and evaluate how to represent real-world data.
  2. TIGERScore (MetricInstruct-Synthesic), abbreviated as Syn, which aimed to evaluate the impact of our synthetic data in enhancing quality as the complement.
  3. TIGERScore (MetricInstruct-Mix), abbreviated as Mix, our official model, to assess the combined effect of using both types of data during training.

Significantly, the Mix model outperformed both ReW and Syn in 5 of the 7 tasks. It also achieves an average correlation approximately 20% higher than ReW and 31% higher than Syn. Although there are slight decreases in correlation for the D2T and Story-Gen tasks when mixing data, the absolute correlation scores for them are around 40, which is considerably high. We think it's an acceptable compromise for the overall model performance. These findings demonstrate the successful complementary nature of our two-channel data, leading to enhanced performance when combined.

Reference

Please kindly cite our paper if you use our code, data, models or results:

@article{Jiang2023TIGERScoreTB,
    title={TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks},
    author={Dongfu Jiang and Yishan Li and Ge Zhang and Wenhao Huang and Bill Yuchen Lin and Wenhu Chen},
    journal={ArXiv},
    year={2023},
    volume={abs/2310.00752},
    url={https://api.semanticscholar.org/CorpusID:263334281}
}
}