Logo VisCoder2

Building Multi-Language Visualization Coding Agents

Yuansheng Ni1*†, Songcheng Cai1*, Xiangchao Chen1*,
Jiarong Liang1, Zhiheng Lyu1, Jiaqi Deng3, Ping Nie5, Kai Zou4, Fei Yuan5, Xiang Yue2, Wenhu Chen1†

1 University of Waterloo, 2 Carnegie Mellon University, 3 Korea Advanced Institute of Science & Technology, 4 Netmind.ai, 5 Independent

† Equal contributions. Corresponding to: yuansheng.ni@uwaterloo.ca, wenhuchen@uwaterloo.ca
geometric reasoning

Overview of VisCoder2. VisCoder2 present three components: 1) VisCode-Multi-679K: a dataset of executable code–visualization pairs with multi-round correction dialogues across 12 programming languages; 2) VisPlotBench: spanning 8 languages with natural language instructions, executable code, and rendered outputs; 3) VisCoder2: a family of visualization coding agents that iteratively execute, render, and self-debug, approaching the performance of proprietary models.

Abstract

Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable execution, and lack of iterative correction mechanisms. Progress has been constrained by narrow datasets and benchmarks that emphasize single-round generation and single-language tasks. To address these challenges, we introduce three complementary resources for advancing visualization coding agents. VisCode-Multi-679K is a large-scale, supervised dataset containing 679K validated and executable visualization samples with multi-turn correction dialogues across 12 programming languages. VisPlotBench is a benchmark for systematic evaluation, featuring executable tasks, rendered outputs, and protocols for both initial generation and multi-round self-debug. Finally, we present VisCoder2, a family of multi-language visualization models trained on VisCode-Multi-679K. Experiments show that VisCoder2 significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4.1, with further gains from iterative self-debug, reaching 82.4% overall execution pass rate at the 32B scale, particularly in symbolic or compiler-dependent languages.

VisCode-Multi-679K

Overview

VisCode-Multi-679K a supervised instruction tuning dataset for visualization code generation and feedback-driven correction across twelve programming languages. It integrates two complementary sources of supervision: (1) a large collection of executable visualization code extracted from open source repositories across twelve programming languages, spanning diverse chart types, libraries, and real-world usage patterns and validated for runtime execution and paired with its rendered output, ensuring reliable supervision for multi-language code generation; and (2) over 66K multi-turn dialogues from the Code-Feedback dataset, which provide feedback-based supervision for code correction. These interactions are critical for modeling realistic correction behaviors in iterative workflows. The full pipeline consists of code filtering, runtime validation, and structured instruction generation.

Data Pipeline

Data construction pipeline for VisCode-Multi-679K. We extract and filter visualization code blocks from open-source repositories and synthetic corpora across twelve programming languages, validate their executability and plot rendering via Jupyter-based runtime checks, and generate structured instructions paired with rendered plots. We integrate multi-turn correction data from Code-Feedback during instruction construction to support iterative refinement.

Code Extraction from Public Repositories

To build a large corpus of multi-language executable visualization code, we source data from three complementary open datasets: the-stack-v2, svg-diagrams, and CoSyn-400K. These sources cover both natural and synthetic visualization usage across twelve languages and diverse rendering styles. The full pipeline includes four stages: library-based filtering, code block extraction, runtime validation, and instruction generation.


Multi-turn Feedback Integration

To support iterative refinement and self-correction, we integrate over 66K multi-turn dialogues from the Code-Feedback dataset. These dialogues span languages such as Python, HTML, JavaScript, and R, containing user instructions, model-generated code, and follow-up turns with execution feedback or correction requests.

Although not restricted to visualization, these interactions provide critical training signals for models to interpret runtime feedback and revise faulty code. The dialogues are combined with single-turn samples from the-stack-v2, svg-diagrams, and CoSyn-400K, allowing models to learn both initial generation and multi-turn correction strategies.

VisPlotBench

Overview

VisPlotBench is a standardized benchmark designed to evaluate visualization coding agents across multiple programming languages. It covers eight visualization languages and includes 888 diverse visualization tasks. Each task pairs a natural language instruction with its corresponding rendered visual and is annotated with both a Visual Category and a Subtype, spanning a total of 13 categories. This design enables fine-grained analysis of model capabilities in understanding, generating, and correcting visualization code across symbolic, declarative, and procedural paradigms.

Overview of VisPlotBench

Overview of VisPlotBench. The benchmark covers eight visualization languages and contains 888 diverse visualization tasks, each combining a natural language instruction and a rendered visual. Tasks are annotated with a Visual category and a Subtype, spanning 13 categories in total.

Existing visualization benchmarks are narrow in scope: most cover a single language, few chart families, and no iterative debugging. VisPlotBench fills these gaps with 888 tasks across eight languages and 13 Visual categories. The taxonomy spans common families such as Bars, Lines, and Scatter, while adding rarely represented ones like Hierarchies, Music, and Networks & Flows. Each task combines a natural language instruction, executable code, and a rendered output, enabling execution-grounded evaluation. With its execute–render–score protocol and multi-round self-debug loop, VisPlotBench provides the first systematic benchmark for assessing visualization coding agents across languages and task types.

Comparison with existing benchmarks

The table above positions VisPlotBench among representative benchmarks across four dimensions: language coverage, visual categories, self-debug support, and dataset size. Earlier resources remain narrow—focusing on Python or Vega-Lite, with limited chart types and no iterative debugging. VisCoder introduced self-debugging for PandasPlotBench, while VisPlotBench generalizes this to eight languages, expands coverage to 13 categories (including Hierarchies, Music, and Networks & Flows), and standardizes evaluation for systematic cross-language assessment.

Experiment Results

Main Results

We evaluate both proprietary and open-source models on VisPlotBench to compare execution reliability across parameter scales, programming languages, and evaluation modes. Proprietary references include GPT-4.1 and its lighter variant GPT-4.1-mini, while open-source baselines include DeepSeek-Coder, DeepSeek-CoderV2, Qwen2.5-Coder, and VisCoder. Our VisCoder2 models are trained on VisCode-Multi-679K using Qwen2.5-Coder backbones at 3B, 7B, 14B, and 32B scales.

main results

Overall execution pass rate (%) of selected models on the VisPlotBench benchmark. The best-performing model in each scale is shown in bold, and the second best is underlined.

Task and Visual Score Analysis

We analyze Task Score and Visual Score on three representative languages that highlight different behaviors: LaTeX illustrates execution–semantics mismatch, LilyPond shows the largest gains on symbolic grammars, and SVG exposes model–library sensitivity where semantic and perceptual signals diverge. Results for all languages and scales are provided in the appendix.

task and visual score analysis

Performance of selected languages on the VisPlotBench benchmark. For each model, we report (1) execution pass rate (Exec Pass), (2) mean visual and task scores (Mean), and (3) the proportion of samples scoring at least 75 (Good). The best-performing model in each scale is shown in bold, and the second best is underlined.

Error Analysis

To examine the error recovery behavior of VisCoder-7B, we analyze how execution error counts transition before and after self-debugging. The table below summarizes four representative error types, grouped by plotting library. Each entry shows the count before and after debugging (e.g., 15 → 2).

Error Table

Execution error transitions for VisCoder-7B across four representative error types. Values show changes from the initial to post-debugging state. Structural issues (e.g., AttributeError) are often resolved, while semantic failures (e.g., KeyError) persist.

Case Study Examples

BibTeX


          @misc{ni2025viscoder2buildingmultilanguagevisualization,
                title={VisCoder2: Building Multi-Language Visualization Coding Agents}, 
                author={Yuansheng Ni and Songcheng Cai and Xiangchao Chen and Jiarong Liang and Zhiheng Lyu and Jiaqi Deng and Kai Zou and Ping Nie and Fei Yuan and Xiang Yue and Wenhu Chen},
                year={2025},
                eprint={2510.23642},
                archivePrefix={arXiv},
                primaryClass={cs.SE},
                url={https://arxiv.org/abs/2510.23642}, 
          }
          @article{ni2025viscoder,
            title={VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation},
            author={Ni, Yuansheng and Nie, Ping and Zou, Kai and Yue, Xiang and Chen, Wenhu},
            journal={arXiv preprint arXiv:2506.03930},
            year={2025}
          }