VisCoder

Fine-Tuning LLMs for Executable Python Visualization Code Generation

Yuansheng Ni¹†, Ping Nie⁴, Kai Zou³, Xiang Yue², Wenhu Chen¹†

¹ University of Waterloo ² Carnegie Mellon University ³ Netmind.ai ⁴ Independent Researcher

† Corresponding authors: yuansheng.ni@uwaterloo.ca, wenhuchen@uwaterloo.ca

arXiv 🤗 VisCode-200K 🤗 VisCoder-7B 🤗 VisCoder-3B Code Examples

Abstract

Large language models (LLMs) often struggle with visualization tasks like plotting diagrams, charts, where success depends on both code correctness and visual semantics. Existing instruction-tuning datasets lack execution-grounded supervision and offer limited support for iterative code correction, resulting in fragile and unreliable plot generation. We present VisCode-200K, a large-scale instruction tuning dataset for Python-based visualization and self-correction. It contains over 200K examples from two sources: (1) validated plotting code from open-source repositories, paired with natural language instructions and rendered plots; and (2) 45K multi-turn correction dialogues from Code-Feedback, enabling models to revise faulty code using runtime feedback. We fine-tune Qwen2.5-Coder-Instruct on VisCode-200K to create VisCoder, and evaluate it on PandasPlotBench. VisCoder significantly outperforms strong open-source baselines and approaches the performance of proprietary models like GPT-4o-mini. We further adopt a self-debug evaluation protocol to assess iterative repair, demonstrating the benefits of feedback-driven learning for executable, visually accurate code generation.

VisCode-200K: Python Visualization Dataset

Overview

VisCode-200K is a supervised instruction tuning dataset for Python-based visualization and feedback-driven code correction. It integrates two complementary sources of supervision: (1) executable visualization code extracted from open-source Python repositories, covering a wide range of real-world chart types, layouts, and plotting libraries, and filtered to ensure runtime validity and compatibility with standard Python environments; and (2) multi-turn Python dialogues from the Code-Feedback dataset, which provide supervision for revising faulty code in response to execution errors. These interactions are critical for modeling realistic correction behaviors in iterative workflows. The full pipeline consists of code filtering, runtime validation, and structured instruction generation.

Data construction pipeline for VisCode-200K. We extract and filter visualization code blocks from open-source Python sources, validate their executability and plot rendering via Jupyter-based runtime checks, and generate structured instructions paired with rendered plots. We integrate multi-turn correction data from Code-Feedback during instruction construction to support iterative refinement.

Code Extraction from Public Repositories

To build a large corpus of executable Python visualization code, we source data from two open datasets: the Python subset of stack-edu and the chart/table partitions of CoSyn-400K. From these corpora, we extract code that uses commonly adopted visualization libraries, including matplotlib, seaborn, and others, to ensure broad coverage of real-world plotting styles. The construction pipeline consists of four stages: library-based filtering, code block extraction, runtime validation, and instruction generation.

Filtering and Code Block Extraction

From the Stack-EDU dataset, we apply library-based filters to identify approximately 1.7M Python samples that invoke common visualization libraries. Since most examples embed visualization logic within broader program contexts, we use GPT-4o-mini to extract minimal, standalone plotting blocks. During this process, we inject mock data to replace missing inputs and ensure that each block can be executed in isolation.

After filtering and reconstruction, we obtain roughly 1M candidate blocks. To balance library distribution, we retain all seaborn and other samples and randomly subsample a matching number of matplotlib examples, resulting in a curated subset of ~300K visualization blocks.

Runtime Validation

To verify executability, we run each code block in an isolated Jupyter environment using nbconvert with allow-error=False. We enforce a timeout and terminate executions that hang or enter infinite loops using a simulated keyboard interrupt.

Only samples that run successfully and generate a valid image file are retained. This step yields 105K validated plotting scripts from stack-edu and 50K from CoSyn-400K, each paired with its corresponding output image.

Instruction Generation

To construct meaningful instructions for visualization code generation, we use GPT-4o to synthesize instruction components based on each validated code block and its corresponding plot. Each instruction includes: (1) setup description, (2) data description, (3) data block preview, (4) high-level plot description, and (5) style description.

For stack-edu, mock data is extracted directly from the code. For CoSyn, we construct a compact preview using the first two rows of the table. The five components are assembled using a fixed template to form a consistent instruction format across all sources.

Multi-turn Feedback Integration

To train models with self-correction capabilities, we incorporate 45K multi-turn dialogues from the Code-Feedback dataset. These dialogues include user instructions, model-generated code, and follow-up turns containing execution feedback or revision prompts.

Starting from 56K dialogues, we filter out samples with excessive length or turn count to ensure training consistency, resulting in a high-quality subset of realistic correction behaviors.

While not specific to visualization, these dialogues offer valuable supervision for teaching models to revise faulty code based on runtime signals. They are integrated into VisCode-200K alongside single-turn samples from Stack-EDU and CoSyn, enabling models to learn both initial generation and multi-turn refinement strategies.

Experiment Results

Main Results

We present the main experimental results on PandasPlotBench, including overall model comparisons, performance under the self-debug evaluation protocol, error type analysis, and a training data ablation study.

We evaluate VisCoder models against both proprietary and open-source language models to assess executable visualization performance across scales and libraries. The proprietary group includes GPT-4o and GPT-4o-mini. Among open-source baselines, we compare LLaMA-3.2-3B, LLaMA-3.1-8B, Qwen2.5-Instruct, and Qwen2.5-Coder-Instruct at both 3B and 7B scales. VisCoder models are trained on VisCode-200K and fine-tuned using the same instruction tuning setup.

Performance of selected models on the PandasPlotBench benchmark. For each model, we report (1) execution pass rate (Exec Pass), (2) mean visual and task scores (Mean), and (3) the proportion of samples scoring at least 75 (Good). The best-performing model in each scale is shown in bold, and the second best is underlined.

Proprietary Models Remain Stronger

Proprietary models outperform open-source models by a wide margin across all plotting libraries. GPT-4o achieves the highest execution pass rates and the strongest judge-based scores, followed by its lightweight variant GPT-4o-mini. These results indicate more reliable execution and better semantic alignment with task instructions, especially in complex visualization settings. In contrast, open-source models like LLaMA and Qwen2.5-Instruct underperform consistently across all metrics.

Plotly Presents Harder Challenge

Performance varies across plotting libraries. While most models perform reliably on matplotlib and seaborn, results on plotly are significantly lower, especially for open-source models. Execution pass rates often drop below 35%, and task/visual scores degrade accordingly. Generated plots frequently fail to reflect the intended semantics or completeness, revealing the challenge posed by plotly's verbose syntax and less frequent API exposure in training corpora.

VisCoder Closes the Open-Source Gap

VisCoder significantly outperforms its Qwen2.5-Coder-Instruct baselines across all libraries. At the 3B scale, it improves execution success and semantic alignment, especially on plotly and seaborn. At 7B, VisCoder even outperforms GPT-4o-mini on these two libraries, although slightly trailing on matplotlib. These gains highlight the impact of domain-specific instruction tuning for visualization code generation.

Self-Debug Further Boosts Performance

GPT-4o demonstrates strong self-debugging capabilities, reaching near-perfect execution success with multiple attempts. VisCoder also benefits substantially under this evaluation protocol. VisCoder-7B surpasses 90% execution rate on matplotlib and seaborn, with large gains in task and visual scores across correction rounds. These results show VisCoder’s ability to generalize debugging behaviors learned from training, even without plot-specific correction examples.

Self-Debug Evaluation

To analyze the dynamics of self-debugging, we track execution pass rates over multiple correction rounds by evaluating GPT-4o and GPT-4o-mini as proprietary baselines, alongside VisCoder models at 3B and 7B scales. To isolate the effects of instruction tuning, we also include untuned Qwen2.5-Coder models at matching sizes. The chart below shows execution pass rates from the initial generation (Attempt 0) through three rounds of self-debugging (Attempts 1–3), presented separately for each plotting library.

Execution pass rate across self-debug rounds (Attempt 0–3), shown separately for three plotting libraries. Attempt 0 corresponds to the default output, while Attempts 1–3 represent subsequent correction rounds. Model groups are color-coded, with solid and dashed lines used to distinguish paired models. VisCoder models improve consistently across rounds, with VisCoder-7B gradually closing the gap to GPT-4o on seaborn. Y-axis ranges are scaled per subplot to match library-specific score distributions.

Self-Debug Is Broadly Effective

Execution pass rates increase steadily over self-debug rounds for most models and libraries, indicating the overall effectiveness of the protocol. The first attempt typically yields the largest improvement, with smaller gains in subsequent rounds. This pattern suggests that a simple retry mechanism informed by execution feedback can recover a substantial portion of initial failures.

VisCoder Yields Stable Behavior

Compared to their Qwen2.5-Coder baselines, VisCoder models show smaller per-round gains but consistently achieve higher final performance. This indicates that VisCoder tends to generate stronger initial outputs and applies more stable corrections. VisCoder-7B is particularly strong on seaborn, approaching GPT-4o by the final round.

Failures Remain Across Models

Even the strongest model GPT-4o does not reach perfect execution after self-debug. Its performance on seaborn plateaus after three rounds, leaving non-trivial failure cases. In contrast, VisCoder-3B stands out among smaller models — outperforming GPT-4o-mini on seaborn and performing competitively elsewhere. Smaller models generally plateau earlier with fewer gains.

Error Analysis

To examine the error recovery behavior of VisCoder-7B, we analyze how execution error counts transition before and after self-debugging. The table below summarizes four representative error types, grouped by plotting library. Each entry shows the count before and after debugging (e.g., 15 → 2).

Execution error transitions for VisCoder-7B across four representative error types. Values show changes from the initial to post-debugging state. Structural issues (e.g., AttributeError) are often resolved, while semantic failures (e.g., KeyError) persist.

Effective Recovery from Structural Errors

VisCoder-7B demonstrates strong self-correction ability on shallow, structural errors. AttributeErrors in Seaborn are reduced from 15 to 2, and TypeErrors in Plotly from 3 to 1. These errors usually stem from invalid method calls or argument mismatches and are easily identified by diagnostic outputs. VisCoder learns to correct them consistently through retry-based feedback.

Persistent Failures in Semantic Execution Errors

Semantic failures such as KeyError and ValueError are harder to resolve. On Plotly, ValueErrors drop only slightly (29 → 23), while KeyErrors remain unchanged. These errors require dynamic reasoning about data structures, but VisCoder’s retry attempts often rely on the same faulty assumptions. Symbolic corrections alone are insufficient for resolving such semantically grounded failures.

Case Study

To illustrate model behavior across different plotting libraries and demonstrate the effectiveness of self-debugging, we present representative examples from VisCoder-7B. For each library—matplotlib, seaborn, and plotly—we show both successful generations and failure cases recovered through multi-round correction. These cases reflect the model's ability to correct common structural errors such as AttributeError and ValueError, while also highlighting persistent challenges in more semantic failures.

Matplotlib – Successful Generation: The model generates code that executes successfully and produces a plot consistent with the ground truth.

Matplotlib – Self-Debug Recovery: An AttributeError raised during initial generation is corrected in the first debug round, resulting in a valid plot.

Seaborn – Successful Generation: Code executes correctly on the first attempt and produces a semantically aligned plot.

Seaborn – Self-Debug Recovery: An AttributeError is fixed after three rounds of debugging, yielding a corrected and faithful plot.

Plotly – Successful Generation: The model correctly generates and executes a visualization that aligns with expected output.

Plotly – Self-Debug Recovery: A ValueError is corrected in the second debug round, producing a valid final result.

VisCoder: Executable Python Visualization

Conclusion

In conclusion, VisCode-200K provides a large-scale instruction tuning dataset for Python visualization code generation, combining executable plotting examples with multi-turn correction dialogues grounded in runtime feedback. To validate its effectiveness, we evaluate VisCoder models on PandasPlotBench using the default setting. Additionally, we propose a self-debug protocol to simulate realistic correction workflows and assess model performance in this extended evaluation mode.

Experiments show that VisCoder substantially outperforms strong open-source baselines across execution and alignment metrics, and narrows the gap to proprietary models like GPT-4o-mini. Gains are particularly pronounced in settings that involve complex visualization structures, such as plotly, and iterative correction through self-debugging. Ablation studies further demonstrate that structurally diverse, executable training data and feedback-driven supervision contribute to more robust performance across plotting libraries.

Looking forward, this work reinforces the importance of domain-specific instruction tuning and multi-turn correction supervision for building robust and semantically grounded visualization-capable models. Future extensions may explore broader plotting libraries, richer correction supervision, and evaluation methods that measure models' abilities to recover from execution errors.

BibTeX


          @article{ni2025viscoder,
            title={VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation},
            author={Ni, Yuansheng and Nie, Ping and Zou, Kai and Yue, Xiang and Chen, Wenhu},
            journal={arXiv preprint arXiv:2506.03930},
            year={2025}
          }