Executable software engineering data is valuable for training SWE agents, but scaling it remains difficult for two reasons: only a small fraction of real repository changes yield verifiable, high-signal task instances, and naively building repository-specific environments quickly becomes the dominant systems cost. We present SWE-Next, an execution-grounded framework for scalable SWE task and trajectory collection. On the data side, SWE-Next mines real merged pull requests, executes candidate base/merged commit pairs, and retains only those that produce strict test improvements without regressions, yielding self-verifying instances. It also applies strict submission gating so that collected trajectories remain evidence-driven rather than speculative. On the systems side, SWE-Next introduces reusable repo-quarter profiles, which reuse the same environment across nearby commits in time while keeping each task run separate and reproducible. Using only 30 hours and 639 GB of environment storage, SWE-Next processes 3,971 seed repositories and 102,582 candidate commit pairs mined from real merged PRs to construct a dataset of 2,308 self-verifying instances. Experiments show that SWE-Next improves downstream pass@1 with fewer or comparable training trajectories, indicating that its gains come not from a stronger trajectory generator, but from higher-signal execution-grounded supervision and more efficient data collection.
An execution-grounded framework that turns real merged-PR commits into self-verifying SWE tasks and pairs them with high-signal trajectory collection defaults such as strict submission gating.
A reusable environment mechanism that amortizes build and storage cost across temporally nearby commits, substantially reducing resource requirements and accelerating large-scale executable SWE data collection.
SWE-Next reduces environment cost by reusing one shared runtime per repository-quarter instead of rebuilding a full Docker image for every commit pair. For each repository, commits are mapped to a coarse-grained repo-quarter profile, and we build a reusable environment that contains only the stable dependency layer, including system packages, Python, and cached dependencies, while keeping repository source code outside the image. At execution time, the commit-specific snapshot is mounted read-only and copied into a writable workspace, so runs stay isolated while still supporting normal testing and diff-based editing. If a shared quarter environment fails for a rare case, SWE-Next falls back to a per-commit environment; in our final retained instances, this happened only once (0.04%).
This amortizes the expensive build-and-validation cost, reducing storage from terabytes to hundreds of gigabytes and cutting build time from days to hours.
With the help of repo-quarter profiles, SWE-Next substantially reduces both storage cost and end-to-end collection time compared with prior executable SWE pipelines at comparable repository scale.
SWE-Next filters candidate data in three stages. We first keep repositories with stable Python test setups, then discard oversized or irrelevant commits before execution, and finally run the same test command on both base and merged commits. Only NEW_COMMIT_BETTER pairs that improve at least one test without regressions are retained as self-verifying instances.
Keep active Python repositories with usable test suites and reproducible container setups.
Drop large, doc-only, and weak-signal commits before spending execution budget.
Execute both commits and keep only strict test improvements with no regressions.
The repositories span a wide range of Python domains, from web frameworks and data science libraries to developer tooling and scientific computing.
SWE-Next achieves higher pass@1 than prior trajectory-collection pipelines, while remaining competitive with substantially larger models.
Most failed runs are still substantive debugging attempts rather than infrastructure breakdowns. Residual environment or harness failures are rare, while the dominant bottleneck is patch quality: agents often reach the right subsystem and produce plausible edits, but still miss a semantic detail or edge case required to satisfy the full target test set.
Among failed runs, 86.0% execute at least one test command, 67.2% run a reproduction script, and 86.8% edit repository files. This indicates that failures are usually meaningful attempts rather than empty rollouts, and that the main challenge remains reliable bug fixing rather than simply getting the environment to run.
@misc{liang2026swenextscalablerealworldsoftware,
title={SWE-Next: Scalable Real-World Software Engineering Tasks for Agents},
author={Jiarong Liang and Zhiheng Lyu and Zijie Liu and Xiangchao Chen and Ping Nie and Kai Zou and Wenhu Chen},
year={2026},
eprint={2603.20691},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2603.20691},
}