SAFIM: a large, syntax-aware Fill-in-the-Middle benchmark (17.7k examples) that reveals pretraining matters more than size

March 7, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

0

Authors

Linyuan Gong, Sida Wang, Mostafa Elhoushi, Alvin Cheung

Links

Abstract / PDF

Why It Matters For Business

SAFIM provides a realistic, execution-checked benchmark and tools that let teams measure real code completion quality and avoid misleading comparisons; it shows training data and objectives often matter more than model size for developer-facing features.

Summary TLDR

The authors release SAFIM, a 17,720-example, multi-language Fill-in-the-Middle (FIM) benchmark focused on syntax-aware completions (algorithmic blocks, control-flow expressions, API calls). They introduce five prompt styles and a syntax-aware truncation step that markedly raises first-attempt pass rates and cuts compile errors. Large-scale evaluation (15+ models) shows pretraining objective and data quality (FIM, repo-level context, execution feedback) explain much of performance variation — often more than parameter count.

Problem Statement

Current code benchmarks focus on whole-function or random masks and can be contaminated by training corpora. This makes it hard to measure LLMs' real ability to fill meaningful, syntax-critical code spans. SAFIM aims to provide a large, syntax-aware, execution-backed FIM benchmark with controlled data cutoff to reduce contamination and fair prompt/post-processing to compare diverse models.

Main Contribution

SAFIM dataset: 17,720 syntax-aware FIM examples across three splits (algorithmic block, control-flow, API call) sourced from Codeforces and GitHub with a post-2022 cutoff to reduce contamination.

Evaluation toolkit: five prompt designs (L2R, PSM, SPM, IPF, 1-shot) and a syntax-aware truncation algorithm that improves automatic scoring.

Large evaluation: pass@1 and compilation-error (CErr%) comparisons across 15+ LLMs revealing that pretraining objective and data quality often beat raw model size.

Key Findings

SAFIM is large and mostly execution-evaluable.

Numbers17,720 examples; 98.25% have unit tests

Syntax-aware truncation substantially raises Pass@1 and lowers compile errors for many models.

NumbersCodeLLaMa-13B algo Pass@1 16.4% → 41.4% (+25.0); CErr% 64.6% → 10.9% (-53.7)

Pretraining objective and data often matter more than model size.

NumbersStarCoder (15.5B) avg 55.5% vs GPT-4 avg 53.3%; DeepSeekCoder-33B avg 69.0 vs CodeGen-16B avg 31.0 (∆38 points)

FIM pretraining helps both FIM and standard left-to-right (L2R) generation.

NumbersStarCoder L2R 29.3% vs CodeGen-16B L2R 24.6% (similar sizes)

Prompt choice changes measured performance a lot.

NumbersGPT-3.5 algo Pass@1 varies 23.2% (L2R) to 31.2% (1S); StarCoder best prompts around 44.1% (PSM/SPM)

Data contamination has small measured impact for evaluated models.

NumbersNew test split (Apr 2023–Jan 2024) shows pass@1 changes generally within ±6 points (e.g., DeepSeekCoder-33B ∆ +0.91)

Results

Dataset size

Value17,720 examples across 3 splits

BaselineHumanEval-Infilling (164 programs)

Execution-based coverage

Value98.25% of examples have unit tests

Best average model (Pass@1)

ValueDeepSeekCoder-33B average 69.0%

BaselineCodeGen-16B avg 31.0%

Syntax-aware truncation effect (example)

ValueCodeGen-16B algo Pass@1 0.0% → 25.9%

BaselineNo truncation

Prompt sensitivity

ValuePer-model best prompts vary; examples: GPT-3.5 best 1S 31.2%, StarCoder best PSM/SPM 44.1%

Baselinesingle-prompt evaluation

Who Should Care

What To Try In 7 Days

Run SAFIM's algorithmic split on your model to gauge real infilling performance.

Add syntax-aware truncation to your post-process pipeline and report Pass@1 and CErr% before/after.

Evaluate 3 prompt styles (PSM, SPM, 1-shot) per model and pick the best for your deployment.

Optimization Features

Training Optimization

  • FIM pretraining
  • repo-level context
  • execution-feedback (self-instruct)

Inference Optimization

  • syntax-aware truncation
  • logits masking for placeholder tokens

Reproducibility

Code Available

Data Available

Open Source Status

  • yes

Risks & Boundaries

Limitations

  • No controlled within-family pretraining ablation; cross-family comparisons may conflate architecture, data, and training signals.
  • API function call split uses syntax matching instead of execution due to side effects and dependencies.
  • Some models (CodeLLaMa, DeepSeekCoder) have partial date overlap with SAFIM sources, though ablation shows small effects.

When Not To Use

  • Do not rely solely on SAFIM for security-critical code correctness (it focuses on completion, not vulnerability detection).
  • Avoid using SAFIM as the only benchmark for models trained on identical private corpora without fresh holdout data.

Failure Modes

  • Models produce extra/unbounded code; without syntax-aware truncation this causes false negative evaluations.
  • Hallucinated or incorrect API arguments in API call completion where execution is infeasible.
  • Apparent improvements that reflect prompt matching or truncation tricks rather than true semantic fixes.

Core Entities

Models

  • GPT-3.5
  • GPT-4
  • CodeGen-350M
  • CodeGen-2B
  • CodeGen-6B
  • CodeGen-16B
  • InCoder-1B
  • InCoder-6B
  • CodeLLaMa-7B
  • CodeLLaMa-13B
  • CodeLLaMa-34B
  • StarCoder-15.5B
  • DeepSeekCoder-1.3B
  • DeepSeekCoder-6.7B
  • DeepSeekCoder-33B
  • Mixtral-8x7B
  • Phi-1.5
  • Phi-2
  • WizardCoder-1B
  • WizardCoder-3B
  • WizardCoder-15B
  • WizardCoder-33B
  • Magicoder-6.7B

Metrics

  • Pass@1
  • CErr% (compilation/syntax error %)

Datasets

  • SAFIM
  • HumanEval-Infilling
  • HumanEval
  • The Stack

Benchmarks

  • SAFIM

Context Entities

Models

  • Codex
  • CodeGen
  • InCoder
  • CodeLLaMa
  • StarCoder

Metrics

  • Pass@k
  • Exact match
  • Syntax match
  • Execution-based pass

Datasets

  • Codeforces
  • GitHub

Benchmarks

  • HumanEval
  • APPS
  • HumanEval-Infilling