SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

March 7, 20247 min

Overview

Decision SnapshotReady For Pilot

SAFIM is ready for practical benchmarking (execution tests, prompt/post-processing). Conclusions about pretraining paradigms are well-supported by cross-model comparisons, but not by randomized controlled pretraining experiments.

Citations0

Evidence Strength0.85

Confidence0.85

Risk Signals8

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 3/5

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 50%

Production readiness: 70%

Novelty: 60%

Authors

Linyuan Gong, Sida Wang, Mostafa Elhoushi, Alvin Cheung

Links

Abstract / PDF / Code / Data

Why It Matters For Business

SAFIM provides a realistic, execution-checked benchmark and tools that let teams measure real code completion quality and avoid misleading comparisons; it shows training data and objectives often matter more than model size for developer-facing features.

Who Should Care

Summary TLDR

The authors release SAFIM, a 17,720-example, multi-language Fill-in-the-Middle (FIM) benchmark focused on syntax-aware completions (algorithmic blocks, control-flow expressions, API calls). They introduce five prompt styles and a syntax-aware truncation step that markedly raises first-attempt pass rates and cuts compile errors. Large-scale evaluation (15+ models) shows pretraining objective and data quality (FIM, repo-level context, execution feedback) explain much of performance variation — often more than parameter count.

Problem Statement

Current code benchmarks focus on whole-function or random masks and can be contaminated by training corpora. This makes it hard to measure LLMs' real ability to fill meaningful, syntax-critical code spans. SAFIM aims to provide a large, syntax-aware, execution-backed FIM benchmark with controlled data cutoff to reduce contamination and fair prompt/post-processing to compare diverse models.

Main Contribution

SAFIM dataset: 17,720 syntax-aware FIM examples across three splits (algorithmic block, control-flow, API call) sourced from Codeforces and GitHub with a post-2022 cutoff to reduce contamination.

Evaluation toolkit: five prompt designs (L2R, PSM, SPM, IPF, 1-shot) and a syntax-aware truncation algorithm that improves automatic scoring.

Key Findings

SAFIM is large and mostly execution-evaluable.

Numbers17,720 examples; 98.25% have unit tests

Practical UseUse SAFIM for realistic FIM testing because most examples support execution-based correctness.

Evidence RefAbstract, Sec.3.3, Table 5

Syntax-aware truncation substantially raises Pass@1 and lowers compile errors for many models.

NumbersCodeLLaMa-13B algo Pass@1 16.4%41.4% (+25.0); CErr% 64.6%10.9% (-53.7)

Practical UseAdd syntax-aware truncation to post-processing to reveal true model competence and reduce false negatives from extra output.

Evidence RefTable 3, Table 11

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Dataset size17,720 examples across 3 splitsHumanEval-Infilling (164 programs)≈108x larger than HumanEval-InfillingSec.3, Table 5Table 5
Execution-based coverage98.25% of examples have unit testsSec.3.3 (execution-based evaluation)Sec.3.3

What To Try In 7 Days

Run SAFIM's algorithmic split on your model to gauge real infilling performance.

Add syntax-aware truncation to your post-process pipeline and report Pass@1 and CErr% before/after.

Evaluate 3 prompt styles (PSM, SPM, 1-shot) per model and pick the best for your deployment.

Optimization Features

Training Optimization
FIM pretrainingrepo-level contextexecution-feedback (self-instruct)
Inference Optimization
syntax-aware truncationlogits masking for placeholder tokens

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusYes
LicenseUnknown

Risks & Boundaries

Limitations

No controlled within-family pretraining ablation; cross-family comparisons may conflate architecture, data, and training signals.

API function call split uses syntax matching instead of execution due to side effects and dependencies.

When Not To Use

Do not rely solely on SAFIM for security-critical code correctness (it focuses on completion, not vulnerability detection).

Avoid using SAFIM as the only benchmark for models trained on identical private corpora without fresh holdout data.

Failure Modes

Models produce extra/unbounded code; without syntax-aware truncation this causes false negative evaluations.

Hallucinated or incorrect API arguments in API call completion where execution is infeasible.

Core Entities

Models

GPT-3.5GPT-4CodeGen-350MCodeGen-2BCodeGen-6BCodeGen-16BInCoder-1BInCoder-6BCodeLLaMa-7BCodeLLaMa-13BCodeLLaMa-34BStarCoder-15.5BDeepSeekCoder-1.3BDeepSeekCoder-6.7BDeepSeekCoder-33BMixtral-8x7BPhi-1.5Phi-2WizardCoder-1BWizardCoder-3BWizardCoder-15BWizardCoder-33BMagicoder-6.7B

Metrics

Pass@1CErr% (compilation/syntax error %)

Datasets

SAFIMHumanEval-InfillingHumanEvalThe Stack

Benchmarks

SAFIM

Context Entities

Models

CodexCodeGenInCoderCodeLLaMaStarCoder

Metrics

Pass@kExact matchSyntax matchExecution-based pass

Datasets

CodeforcesGitHub

Benchmarks

HumanEvalAPPSHumanEval-Infilling