Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.
VisCode-Multi-679K a supervised instruction tuning dataset for visualization code generation and feedback-driven correction across twelve programming languages. It integrates two complementary sources of supervision: (1) a large collection of executable visualization code extracted from open source repositories across twelve programming languages, spanning diverse chart types, libraries, and real-world usage patterns and validated for runtime execution and paired with its rendered output, ensuring reliable supervision for multi-language code generation; and (2) over 66K multi-turn dialogues from the Code-Feedback dataset, which provide feedback-based supervision for code correction. These interactions are critical for modeling realistic correction behaviors in iterative workflows. The full pipeline consists of code filtering, runtime validation, and structured instruction generation.
Data construction pipeline for VisCode-Multi-679K. We extract and filter visualization code blocks from open-source repositories and synthetic corpora across twelve programming languages, validate their executability and plot rendering via Jupyter-based runtime checks, and generate structured instructions paired with rendered plots. We integrate multi-turn correction data from Code-Feedback during instruction construction to support iterative refinement.
To build a large corpus of multi-language executable visualization code, we source data from three complementary open datasets:
the-stack-v2,
svg-diagrams,
and
CoSyn-400K.
These sources cover both natural and synthetic visualization usage across twelve languages and diverse rendering styles.
The full pipeline includes four stages: library-based filtering, code block extraction, runtime validation, and instruction generation.
From the-stack-v2 and its high-quality subsets
stack-edu
and the-stack-v2-train-smol-ids,
we identify approximately 5.3M visualization-related code candidates across multiple languages, including Python, JavaScript, C++, TypeScript, HTML, and R.
Since many examples are embedded in broader program contexts, we use GPT-4.1-mini to extract self-contained plotting blocks.
For missing data definitions, mock inputs are inserted to ensure each block executes independently.
For svg-diagrams, we filter 182K SVG-based diagram samples and retain about 79K valid visualization blocks.
For CoSyn-400K, we select 408K structured visualization snippets across eight languages (e.g., Python, HTML, LaTeX, SVG, Asymptote, Mermaid, LilyPond, Vega-Lite)
and reconstruct runnable scripts where needed by inserting minimal data or plotting calls.
In total, roughly 900K candidate blocks are collected before validation.
We verify executability in isolated Jupyter environments with dedicated kernels for each language
(C++, JavaScript, R, etc.).
All blocks are executed using nbconvert with allow-error=False under strict timeout control.
We terminate hanging or looping executions via simulated keyboard interrupts and discard monochrome or invalid outputs.
The final validated set includes 245K plotting scripts from the-stack-v2,
43K from svg-diagrams, and 322K from CoSyn-400K,
each paired with its rendered visualization output.
To produce consistent and interpretable supervision signals, we use GPT-4.1 to generate natural language instructions for each validated code-image pair.
Each instruction consists of five structured components:
(1) setup description (language and libraries),
(2) data or visual description,
(3) a small data preview (if applicable),
(4) high-level output description, and
(5) style description.
This unified template ensures that every example includes both structural and semantic context for visualization code generation, enabling instruction tuning across diverse programming languages and visualization styles.
To support iterative refinement and self-correction, we integrate over 66K multi-turn dialogues from the Code-Feedback dataset. These dialogues span languages such as Python, HTML, JavaScript, and R, containing user instructions, model-generated code, and follow-up turns with execution feedback or correction requests.
Although not restricted to visualization, these interactions provide critical training signals for models to
interpret runtime feedback and revise faulty code.
The dialogues are combined with single-turn samples from the-stack-v2, svg-diagrams, and CoSyn-400K,
allowing models to learn both initial generation and multi-turn correction strategies.
VisPlotBench is a standardized benchmark designed to evaluate visualization coding agents across multiple programming languages. It covers eight visualization languages and includes 888 diverse visualization tasks. Each task pairs a natural language instruction with its corresponding rendered visual and is annotated with both a Visual Category and a Subtype, spanning a total of 13 categories. This design enables fine-grained analysis of model capabilities in understanding, generating, and correcting visualization code across symbolic, declarative, and procedural paradigms.
Overview of VisPlotBench. The benchmark covers eight visualization languages and contains 888 diverse visualization tasks, each combining a natural language instruction and a rendered visual. Tasks are annotated with a Visual category and a Subtype, spanning 13 categories in total.
Existing visualization benchmarks are narrow in scope: most cover a single language, few chart families, and no iterative debugging. VisPlotBench fills these gaps with 888 tasks across eight languages and 13 Visual categories. The taxonomy spans common families such as Bars, Lines, and Scatter, while adding rarely represented ones like Hierarchies, Music, and Networks & Flows. Each task combines a natural language instruction, executable code, and a rendered output, enabling execution-grounded evaluation. With its execute–render–score protocol and multi-round self-debug loop, VisPlotBench provides the first systematic benchmark for assessing visualization coding agents across languages and task types.
The table above positions VisPlotBench among representative benchmarks across four dimensions:
language coverage, visual categories, self-debug support, and dataset size.
Earlier resources remain narrow—focusing on Python or Vega-Lite,
with limited chart types and no iterative debugging.
VisCoder introduced self-debugging for PandasPlotBench, while VisPlotBench generalizes this to eight languages,
expands coverage to 13 categories (including Hierarchies, Music, and Networks & Flows),
and standardizes evaluation for systematic cross-language assessment.
We curate a high-quality pool of visualization tasks from multiple open datasets and repositories, ensuring broad coverage across both general-purpose and domain-specific visualization frameworks. Each example is verified for executability, correct rendering, and consistent natural language pairing.
Sources include the-stack-v2, svg-diagrams, and CoSyn-400K,
spanning languages such as Python, JavaScript, LaTeX,
Asymptote, Vega-Lite, and LilyPond.
During curation, invalid, monochrome, or non-visual outputs are filtered, and missing inputs are synthetically reconstructed to ensure each sample executes independently.
Each task in VisPlotBench combines a runnable visualization script with a natural language prompt that describes its intended output. Tasks are categorized into 13 major visual types, such as statistical plots, geometric diagrams, music scores, network graphs, and typographic layouts, each subdivided into finer-grained subtypes.
The benchmark emphasizes diversity and interpretability: prompts include sufficient context to test an agent’s understanding of syntax, semantics, and rendering logic, rather than mere text-to-code matching. By covering both declarative and procedural paradigms, VisPlotBench evaluates how well models generalize across visualization styles and language conventions.
Evaluation follows a unified execute–render–score protocol. Each model-generated code snippet is executed in a sandboxed environment specific to its target language, rendered into an image, and compared against the reference visualization.
Quantitative metrics include execution success rate, structural similarity between generated and reference visuals, and semantic alignment scores derived from visual encoders. This consistent evaluation pipeline ensures comparability across languages, promoting fair benchmarking of multi-language visualization agents.
We evaluate both proprietary and open-source models on VisPlotBench to compare execution reliability across parameter scales, programming languages, and evaluation modes. Proprietary references include GPT-4.1 and its lighter variant GPT-4.1-mini, while open-source baselines include DeepSeek-Coder, DeepSeek-CoderV2, Qwen2.5-Coder, and VisCoder. Our VisCoder2 models are trained on VisCode-Multi-679K using Qwen2.5-Coder backbones at 3B, 7B, 14B, and 32B scales.
Overall execution pass rate (%) of selected models on the VisPlotBench benchmark. The best-performing model in each scale is shown in bold, and the second best is underlined.
GPT-4.1 achieves 63.4% overall, the highest among reference models, and GPT-4.1-mini follows closely. Both perform strongly on standardized declarative or markup languages such as Vega-Lite, SVG, and HTML, all above 84%. In contrast, instruction-tuned open-source models remain far behind. At the 7B scale, Qwen2.5-Coder reaches only 51.2% overall, with fewer than 30% on LaTeX and just 5.5% on LilyPond. Previous VisCoder variants improve Python performance but fail to generalize across languages. These results underline the substantial gap between proprietary and open-source models.
Performance differs sharply across visualization languages. Vega-Lite and HTML are close to saturation for most models, while Python shows steady gains with scale. By contrast, symbolic and compiler-dependent languages remain the most difficult. Even GPT-4.1 achieves less than 45% on LilyPond and under 25% on Asymptote, and open-source baselines fall much lower. This uneven landscape highlights that progress on symbolic grammars is the key bottleneck for reliable multi-language visualization.
Across all scales, VisCoder2 consistently outperforms size-matched open-source baselines. At 32B, it improves overall execution pass rate by approximately 15 points compared with Qwen2.5-Coder and reaches parity with GPT-4.1. The only consistent shortfall is on SVG, where VisCoder2 trails the strongest baseline by over 10 points. Overall, VisCoder2 is the first open-source model to match proprietary reliability on executable visualization tasks.
Iterative correction consistently improves execution reliability across model families and scales. Proprietary models benefit strongly, and VisCoder2 follows the same trend: at larger scales, overall execution rises by nearly ten points when self-debugging is enabled. The effect is especially pronounced for symbolic and compiler-dependent languages such as LilyPond, LaTeX, and Asymptote, where fragile syntax or compilation errors dominate. Self-debugging enables the model to repair these shallow but frequent failures, allowing models to resolve previously intractable failures into valid outputs. This demonstrates that feedback-driven refinement is not just a marginal improvement but a critical mechanism for tackling the hardest visualization languages.
We analyze Task Score and Visual Score on three representative languages that highlight different behaviors:
LaTeX illustrates execution–semantics mismatch, LilyPond shows the largest gains on symbolic grammars,
and SVG exposes model–library sensitivity where semantic and perceptual signals diverge.
Results for all languages and scales are provided in the appendix.
Performance of selected languages on the VisPlotBench benchmark. For each model, we report (1) execution pass rate (Exec Pass), (2) mean visual and task scores (Mean), and (3) the proportion of samples scoring at least 75 (Good). The best-performing model in each scale is shown in bold, and the second best is underlined.
Models often capture the intended structure of a figure but fail to compile reliably. For example, GPT-4.1 improves from 31.3% to 66.1% execution pass rate with Self-Debug, while task scores remain around 50 even when execution fails. VisCoder2 raises execution and task scores compared with baselines, but compilation errors remain frequent. This pattern indicates that semantic alignment does not always translate into successful rendering.
VisCoder2 delivers the clearest advantage on symbolic languages. At 7B, Qwen2.5-Coder executes only 5.5% of tasks, while VisCoder2 reaches 69.1% and further improves with Self-Debug. The proportion of examples with task scores above 75 also increases by more than tenfold. These results show that targeted coverage of symbolic grammars in VisCode-Multi-679K translates directly into reliable generation and semantic adherence.
Execution success is high across most models, yet visual scores lag behind task scores. For instance, GPT-4.1 with Self-Debug achieves 95.4% execution and a task score near 90, but the average visual score is below 50. VisCoder2 performs competitively but trails Qwen2.5 on execution at larger scales (81.5% versus 93.9% at 32B). These discrepancies suggest that evaluation on SVG is strongly influenced by library-specific rendering details rather than semantic understanding alone.
To examine the error recovery behavior of VisCoder-7B, we analyze how execution error counts transition before and after self-debugging. The table below summarizes four representative error types, grouped by plotting library. Each entry shows the count before and after debugging (e.g., 15 → 2).
Execution error transitions for VisCoder-7B across four representative error types.
Values show changes from the initial to post-debugging state. Structural issues (e.g., AttributeError)
are often resolved, while semantic failures (e.g., KeyError) persist.
To better understand failure modes across languages, we analyze execution errors before and after self-debug.
Many language-specific exceptions, such as FunctionSignatureError in Asymptote or MarkupError in LilyPond,
were merged into four broader categories for clarity: Structural Errors (syntax or parsing),
Type & Interface Errors (invalid calls or arguments), Semantic / Data Errors (mismatched variables or values),
and Runtime / Environment Errors (renderer or package issues).
Representative results for VisCoder2-32B are shown below, demonstrating error transitions from initial failure to final self-debug round.
Self-debug effectively reduces shallow errors such as missing tokens or invalid arguments across multiple languages. For example, Python interface errors fall from 13 to 3, and structural errors in LilyPond decrease from 14 to 10. Mermaid and Asymptote show the same trend, with syntax and function signature errors shrinking after correction (Asymptote structural errors drop from 9 to 3). These cases benefit from explicit diagnostic traces, making them relatively easy to fix through iterative feedback.
Errors involving semantics or execution environments remain difficult to resolve. In LaTeX, undefined variables decrease only slightly (28 to 23), and Asymptote variable mismatches improve only marginally (15 to 11). Renderer failures such as Vega-Lite rendering errors (2 to 2) and HTML request failures (3 to 2) often persist across all rounds. These errors require deeper reasoning over symbolic grammars and runtime contexts, which current self-debug protocols cannot fully capture. Symbolic languages and renderer-sensitive environments therefore remain the dominant bottlenecks, pointing to the need for grammar-aware training objectives and more robust runtime integration.
Python – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
Python – Self-Debug Recovery: The initial code raises a ValueError and is resolved in the first round of self-debug, resulting in a corrected plot that matches the intended semantics.
Python – Self-Debug Failed: The initial code raises a AttributeError and is still failed after three rounds self-debug.
Vega-Lite – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
Vega-Lite – Self-Debug Recovery: The initial code raises a TypeError and is resolved in the second round of self-debug, resulting in a corrected plot that matches the intended semantics.
Vega-Lite – Self-Debug Failed: The initial code raises a TypeError and is still failed after three rounds self-debug.
Lilypond – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
Lilypond – Self-Debug Recovery: The initial code raises a SyntaxError and is resolved in the first round of self-debug, resulting in a corrected plot that matches the intended semantics.
Lilypond – Self-Debug Failed: The initial code raises a TypeError and is still failed after three rounds self-debug.
Mermaid – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
Mermaid – Self-Debug Recovery: The initial code raises a SyntaxError and is resolved in the second round of self-debug, resulting in a corrected plot that matches the intended semantics.
Mermaid – Self-Debug Failed: The initial code raises a AttributeError and is still failed after three rounds self-debug.
SVG – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
SVG – Self-Debug Recovery: The initial code raises a ExPatError and is resolved in the first round of self-debug, resulting in a corrected plot that matches the intended semantics.
SVG – Self-Debug Failed: The initial code raises a ParseError and is still failed after three rounds self-debug.
LaTeX – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
LaTeX – Self-Debug Recovery: The initial code raises a NameError and is resolved in the second round of self-debug, resulting in a corrected plot that matches the intended semantics.
LaTeX – Self-Debug Failed: The initial code raises a NameError and is still failed after three rounds self-debug.
Asymptote – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
Asymptote – Self-Debug Recovery: The initial code raises a NameError and is resolved in the third round of self-debug, resulting in a corrected plot that matches the intended semantics.
Asymptote – Self-Debug Failed: The initial code raises a TypeError and is still failed after three rounds self-debug.
HTML – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.
HTML – Self-Debug Recovery: The initial code raises a ImportError and is resolved in the first round of self-debug, resulting in a corrected plot that matches the intended semantics.
HTML – Self-Debug Failed: The initial code raises a TypeError and is still failed after three rounds self-debug.
@misc{ni2025viscoder2buildingmultilanguagevisualization,
title={VisCoder2: Building Multi-Language Visualization Coding Agents},
author={Yuansheng Ni and Songcheng Cai and Xiangchao Chen and Jiarong Liang and Zhiheng Lyu and Jiaqi Deng and Kai Zou and Ping Nie and Fei Yuan and Xiang Yue and Wenhu Chen},
year={2025},
eprint={2510.23642},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2510.23642},
}
@article{ni2025viscoder,
title={VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation},
author={Ni, Yuansheng and Nie, Ping and Zou, Kai and Yue, Xiang and Chen, Wenhu},
journal={arXiv preprint arXiv:2506.03930},
year={2025}
}