EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing

^†Keming Wu^{♠️♣️*}, Sicong Jiang^♡ϑ*, Max Ku^♠️, Ping Nie^♢, Minghao Liu^♡, ^†Wenhu Chen^♠️

^♠️University of Waterloo, ^♣️Tsinghua Univerisity, ^♡2077AI, ^ϑMcGill University, ^♢Independent,
Corresponding to: wukeming0608@gmail.com, wenhuchen@uwaterloo.ca

Paper Code EditReward-Data 🤗 EditReward 📊 EditReward-Bench

EditReward

Our contribution is in three folds:

(1) We construct and release EditReward-Data, a large-scale (200K) preference dataset for image editing, distinguished by its high-quality manual annotations and diversity of sources.

(2) We train and release EditReward, a VLM-based reward model trained on EditReward-Data that demonstrates superior alignment with human preferences.

(3) We propose EditReward-Bench, a new benchmark featuring a more challenging multi-way preference ranking task that provides a more robust evaluation of reward models.

An overview of our framework, illustrating the construction of the EditReward-Data and the subsequent training of our reward model, EditReward. Top: The data pipeline, where we generate a diverse candidate pool from multiple state-of-the-art models and collect multi-dimensional human preference annotations. Bottom: The model pipeline, where EditReward is optimized on EditReward-Data using our proposed Multi-Dimensional Uncertainty-Aware Ranking Loss for training, followed by its use in inference.

EditReward-Data

EditReward-Data, a large-scale (200K) preference dataset for image editing, distinguished by its high-quality manual annotations and diversity of sources.

Our dataset has the most diverse, highest-quality image editing pairs of any resolution.

Experimental Results: Alignment with Humans

We evaluate our approach on a suite of three established public benchmarks and our newly proposed benchmark, designed to provide a more comprehensive assessment of image editing quality.

Application: EditReward as a Reward

To demonstrate EditReward's practical utility as a data supervisor, we conducted a data curation experiment designed to improve a state-of-the-art editing model. We employed our reward model to score the ~46,000 examples in the ShareGPT-4o-Image dataset, from which we selected a high-quality subset of the top 20,000 samples. This curated dataset was then used to fine-tune the powerful Step1X-Edit model.

@article{wu2025editreward, title={EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing}, author={Wu, Keming and Jiang, Sicong and Ku, Max and Nie, Ping and Liu, Minghao and Chen, Wenhu}, journal={arXiv preprint arXiv:2509.26346}, year={2025} }