SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

Overview

Decision SnapshotReady For Pilot

SAFIM is ready for practical benchmarking (execution tests, prompt/post-processing). Conclusions about pretraining paradigms are well-supported by cross-model comparisons, but not by randomized controlled pretraining experiments.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Linyuan Gong, Sida Wang, Mostafa Elhoushi, Alvin Cheung

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SAFIM provides a realistic, execution-checked benchmark and tools that let teams measure real code completion quality and avoid misleading comparisons; it shows training data and objectives often matter more than model size for developer-facing features.

Who Should Care

ML Engineer Engineering Lead Product Manager CTO Data Scientist

Summary TLDR

The authors release SAFIM, a 17,720-example, multi-language Fill-in-the-Middle (FIM) benchmark focused on syntax-aware completions (algorithmic blocks, control-flow expressions, API calls). They introduce five prompt styles and a syntax-aware truncation step that markedly raises first-attempt pass rates and cuts compile errors. Large-scale evaluation (15+ models) shows pretraining objective and data quality (FIM, repo-level context, execution feedback) explain much of performance variation — often more than parameter count.

Problem Statement

Current code benchmarks focus on whole-function or random masks and can be contaminated by training corpora. This makes it hard to measure LLMs' real ability to fill meaningful, syntax-critical code spans. SAFIM aims to provide a large, syntax-aware, execution-backed FIM benchmark with controlled data cutoff to reduce contamination and fair prompt/post-processing to compare diverse models.

Main Contribution

SAFIM dataset: 17,720 syntax-aware FIM examples across three splits (algorithmic block, control-flow, API call) sourced from Codeforces and GitHub with a post-2022 cutoff to reduce contamination.

Evaluation toolkit: five prompt designs (L2R, PSM, SPM, IPF, 1-shot) and a syntax-aware truncation algorithm that improves automatic scoring.

Key Findings

SAFIM is large and mostly execution-evaluable.

Numbers17,720 examples; 98.25% have unit tests

Practical UseUse SAFIM for realistic FIM testing because most examples support execution-based correctness.

Evidence RefAbstract, Sec.3.3, Table 5

Syntax-aware truncation substantially raises Pass@1 and lowers compile errors for many models.

NumbersCodeLLaMa-13B algo Pass@1 16.4% → 41.4% (+25.0); CErr% 64.6% → 10.9% (-53.7)

Practical UseAdd syntax-aware truncation to post-processing to reveal true model competence and reduce false negatives from extra output.

Evidence RefTable 3, Table 11

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Dataset size	17,720 examples across 3 splits	HumanEval-Infilling (164 programs)	≈108x larger than HumanEval-Infilling	—	Sec.3, Table 5	Table 5
Execution-based coverage	98.25% of examples have unit tests	—	—	—	Sec.3.3 (execution-based evaluation)	Sec.3.3

What To Try In 7 Days

Run SAFIM's algorithmic split on your model to gauge real infilling performance.

Add syntax-aware truncation to your post-process pipeline and report Pass@1 and CErr% before/after.

Evaluate 3 prompt styles (PSM, SPM, 1-shot) per model and pick the best for your deployment.

Optimization Features

Training Optimization

FIM pretrainingrepo-level contextexecution-feedback (self-instruct)

Inference Optimization

syntax-aware truncationlogits masking for placeholder tokens

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/gonglinyuan/safim

Data URLs

https://github.com/gonglinyuan/safim https://safimbenchmark.com

Risks & Boundaries

Limitations

No controlled within-family pretraining ablation; cross-family comparisons may conflate architecture, data, and training signals.

API function call split uses syntax matching instead of execution due to side effects and dependencies.

When Not To Use

Do not rely solely on SAFIM for security-critical code correctness (it focuses on completion, not vulnerability detection).

Avoid using SAFIM as the only benchmark for models trained on identical private corpora without fresh holdout data.

Failure Modes

Models produce extra/unbounded code; without syntax-aware truncation this causes false negative evaluations.

Hallucinated or incorrect API arguments in API call completion where execution is infeasible.

Core Entities

Models

GPT-3.5GPT-4CodeGen-350MCodeGen-2BCodeGen-6BCodeGen-16BInCoder-1BInCoder-6BCodeLLaMa-7BCodeLLaMa-13BCodeLLaMa-34BStarCoder-15.5BDeepSeekCoder-1.3BDeepSeekCoder-6.7BDeepSeekCoder-33BMixtral-8x7BPhi-1.5Phi-2WizardCoder-1BWizardCoder-3BWizardCoder-15BWizardCoder-33BMagicoder-6.7B

Metrics

Pass@1CErr% (compilation/syntax error %)

Datasets

SAFIMHumanEval-InfillingHumanEvalThe Stack

Benchmarks

SAFIM

Context Entities

Models

CodexCodeGenInCoderCodeLLaMaStarCoder

Metrics

Pass@kExact matchSyntax matchExecution-based pass

Datasets

CodeforcesGitHub

Benchmarks

HumanEvalAPPSHumanEval-Infilling

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

SAFIM is large and mostly execution-evaluable.

Syntax-aware truncation substantially raises Pass@1 and lowers compile errors for many models.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

MCTS-Judge: Use Monte Carlo Tree Search at test time to double LLM judge accuracy on code tasks

Key finding

Separate the algorithm idea from code: use editorials to measure reasoning vs implementation

Key finding

Train an LLM judge that learns which training examples matter and boosts Best-of-N code selection

Key finding

Execution-driven, real-world benchmark for secure code generation across 5 languages

Key finding

Plot2Code: a focused benchmark that asks multimodal LLMs to generate matplotlib code from scientific plots

Key finding