ImagenWorld: Stress-Testing Image Generation Models with Explainable Human Evaluation on Open-ended Real-World Tasks

ImagenWorld
1Samin Mahdizadeh Sani*, 1♠Max Ku*, 1Nima Jamali, 1Matina Mahdizadeh Sani, Paria Khoshtab, 2Wei-Chieh Sun, 3Parnian Fazel, 4Zhi Rui Tam, Thomas Chong, 5Edisy Kin Wai Chan, 6Donald Wai Tong Tsang, Chiao-Wei Hsu, 7Ting Wai Lam, 8Ho Yin Sam Ng, Chiafeng Chu, 9Chak-Wing Mak, aKeming Wu, Hiu Tung Wong, 7Yik Chun Ho, 1Chi Ruan, bZhuofeng Li, cI-Sheng Fang, dShih-Ying Yeh, eHo Kei Cheng, Ping Nie, 1♠Wenhu Chen
1University of Waterloo, 2University of Washington, 3Imperial College, 4National Taiwan University, 5University of Southampton, 6University of British Columbia, 7Chinese University of Hong Kong, 8Pennsylvania State University, 9Peking University, aTsinghua University, bTexas A&M University, cAcademia Sinica, dNTHU, eUniversity of Illinois Urbana-Champaign, Vector Institute, Independent samin.mahdizadeh@gmail.com, m3ku@uwaterloo.ca, wenhuchen@uwaterloo.ca

TL;DR

We build a 3.6K conditions set, 6-task x 6-domain benchmark with 20K explainable human annotations that stress-tests image generation and editing, shows where models break (notably local edits and text-heavy content), benchmarks against VLM-as-judge baselines, and identifies key failure modes.

Abstract

Advances in diffusion, autoregressive, and hybrid models have enabled high-quality image synthesis for tasks such as text-to-image, editing, and reference-guided composition. Yet, existing benchmarks remain limited, either focus on isolated tasks, cover only narrow domains, or provide opaque scores without explaining failure modes. We introduce ImagenWorld, a benchmark of 3.6K condition sets spanning six core tasks (generation and editing, with single or multiple references) and six topical domains (artworks, photorealistic images, information graphics, textual graphics, computer graphics, and screenshots). The benchmark is supported by 20K fine-grained human annotations and an explainable evaluation schema that tags localized object-level and segment-level errors, complementing automated VLM-based metrics. Our large-scale evaluation of 14 models yields several insights: (1) models typically struggle more in editing tasks than in generation tasks, especially in local edits. (2) models excel in artistic and photorealistic settings but struggle with symbolic and text-heavy domains such as screenshots and information graphics. (3) closed-source systems lead overall, while targeted data curation (e.g., Qwen-Image) narrows the gap in text-heavy cases. (4) modern VLM-based metrics achieve Kendall accuracies up to 0.79, approximating human ranking, but fall short of fine-grained, explainable error attribution. ImagenWorld provides both a rigorous benchmark and a diagnostic tool to advance robust image generation.

ImagenWorld

Overview

We introduce ImagenWorld, a large-scale, human-centric benchmark designed to stress-test image generation models in real-world scenarios. Unlike prior evaluations that focus on isolated tasks or narrow domains, ImagenWorld is organized into six domains: Artworks, Photorealistic Images, Information Graphics, Textual Graphics, Computer Graphics, and Screenshots, and six tasks: Text-to-Image Generation (TIG), Single-Reference Image Generation (SRIG), Multi-Reference Image Generation (MRIG), Text-to-Image Editing (TIE), Single-Reference Image Editing (SRIE), and Multi-Reference Image Editing (MRIE). The benchmark includes 3.6K condition sets and 20K fine-grained human annotations, providing a comprehensive testbed for generative models. To support explainable evaluation, ImagenWorld applies object- and segment-level extraction to generated outputs, identifying entities such as objects and fine-grained regions. This structured decomposition enables human annotators to provide not only scalar ratings but also detailed tags of object-level and segment-level failures.

MY ALT TEXT

Dataset Preview

We build a diverse benchmark covering 6 topics x 6 tasks = 36 areas, and we annotated over 20K images in total, each provided object-level and segment-level annotations to localize the errors.

MY ALT TEXT

Figure 1: Illustrative samples from our dataset across six tasks: Text-to-Image Generation (TIG), Single-Reference Image Generation (SRIG), Multi-Reference Image Generation (MRIG), Text-to-Image Editing (TIE), Single-Reference Image Editing (SRIE), and Multi-Reference Image Editing (MRIE). For each task, we show both a successful generation (green) and a failure case (red).

MY ALT TEXT

Figure 2: Examples include object-level issues, where expected objects are missing or distorted, and segment-level issues, where annotated regions highlight specific regions with visual inconsistencies that affect evaluation scores.

Overall Results

  • Across ImagenWorld, models struggle primarily with editing (TIE/SRIE/MRIE) compared to generation (TIG/SRIG/MRIG). Performance peaks on Artworks and Photorealistic images, whereas Information Graphics and Screenshots are most challenging due to their symbolic content and text-heavy, structured layouts.
MY ALT TEXT

Figure 3: Mean human evaluation scores across our four metrics by topic (left) and task (right).

MY ALT TEXT

Figure 4: Overall human rating by task and topic for the four unified models that support all six tasks.

  • Models struggle to execute localized edits reliably: AR–diffusion hybrids often overwrite the input with a new image, while diffusion editors struggle in the opposite way, frequently doing nothing and leaving the input unchanged.
MY ALT TEXT

Figure 5: Percentage of cases where the model generating completely new image or simply return input in image editing tasks.

Common Failure Modes

1. Failing to Precisely Follow Instructions

Prompt:

Edit image 1. Replace the top-left crate with the yellow warning sign from image 3. Place the pink crewmate (from the center of image 2) and the yellow crewmate (from the bottom right of image 2) standing side-by-side on the central doorway in image 1. Ensure all new elements are integrated with correct perspective, lighting, and scale.

MY ALT TEXT

Figure 6: Instruction-following problem: The model placed red and green crewmates instead of pink and yellow, and the yellow sign’s position does not match the request.

2. Numerical Inconsistencies

MY ALT TEXT

Figure 7: Examples of numerical inconsistencies.

3. Segments and Labeling Issues

MY ALT TEXT

Figure 8: Examples of labeling issues.

4. Generating New Image in Editing

MY ALT TEXT

Figure 9: Examples of generating a new image when the task is editing.

5. Plots and chart errors

MY ALT TEXT

Figure 10: Examples of plots and diagram issues.

6. Unreadable Text

MY ALT TEXT

Figure 11: Examples of text issues.

Citation

Please kindly cite our paper if you use our code, data, models or results:

@inproceedings{}