Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
0
Why It Matters For Business
SAFIM provides a realistic, execution-checked benchmark and tools that let teams measure real code completion quality and avoid misleading comparisons; it shows training data and objectives often matter more than model size for developer-facing features.
Summary TLDR
The authors release SAFIM, a 17,720-example, multi-language Fill-in-the-Middle (FIM) benchmark focused on syntax-aware completions (algorithmic blocks, control-flow expressions, API calls). They introduce five prompt styles and a syntax-aware truncation step that markedly raises first-attempt pass rates and cuts compile errors. Large-scale evaluation (15+ models) shows pretraining objective and data quality (FIM, repo-level context, execution feedback) explain much of performance variation — often more than parameter count.
Problem Statement
Current code benchmarks focus on whole-function or random masks and can be contaminated by training corpora. This makes it hard to measure LLMs' real ability to fill meaningful, syntax-critical code spans. SAFIM aims to provide a large, syntax-aware, execution-backed FIM benchmark with controlled data cutoff to reduce contamination and fair prompt/post-processing to compare diverse models.
Main Contribution
SAFIM dataset: 17,720 syntax-aware FIM examples across three splits (algorithmic block, control-flow, API call) sourced from Codeforces and GitHub with a post-2022 cutoff to reduce contamination.
Evaluation toolkit: five prompt designs (L2R, PSM, SPM, IPF, 1-shot) and a syntax-aware truncation algorithm that improves automatic scoring.
Large evaluation: pass@1 and compilation-error (CErr%) comparisons across 15+ LLMs revealing that pretraining objective and data quality often beat raw model size.
Key Findings
SAFIM is large and mostly execution-evaluable.
Syntax-aware truncation substantially raises Pass@1 and lowers compile errors for many models.
Pretraining objective and data often matter more than model size.
FIM pretraining helps both FIM and standard left-to-right (L2R) generation.
Prompt choice changes measured performance a lot.
Data contamination has small measured impact for evaluated models.
Results
Dataset size
Execution-based coverage
Best average model (Pass@1)
Syntax-aware truncation effect (example)
Prompt sensitivity
Who Should Care
What To Try In 7 Days
Run SAFIM's algorithmic split on your model to gauge real infilling performance.
Add syntax-aware truncation to your post-process pipeline and report Pass@1 and CErr% before/after.
Evaluate 3 prompt styles (PSM, SPM, 1-shot) per model and pick the best for your deployment.
Optimization Features
Training Optimization
- FIM pretraining
- repo-level context
- execution-feedback (self-instruct)
Inference Optimization
- syntax-aware truncation
- logits masking for placeholder tokens
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- No controlled within-family pretraining ablation; cross-family comparisons may conflate architecture, data, and training signals.
- API function call split uses syntax matching instead of execution due to side effects and dependencies.
- Some models (CodeLLaMa, DeepSeekCoder) have partial date overlap with SAFIM sources, though ablation shows small effects.
When Not To Use
- Do not rely solely on SAFIM for security-critical code correctness (it focuses on completion, not vulnerability detection).
- Avoid using SAFIM as the only benchmark for models trained on identical private corpora without fresh holdout data.
Failure Modes
- Models produce extra/unbounded code; without syntax-aware truncation this causes false negative evaluations.
- Hallucinated or incorrect API arguments in API call completion where execution is infeasible.
- Apparent improvements that reflect prompt matching or truncation tricks rather than true semantic fixes.
Core Entities
Models
- GPT-3.5
- GPT-4
- CodeGen-350M
- CodeGen-2B
- CodeGen-6B
- CodeGen-16B
- InCoder-1B
- InCoder-6B
- CodeLLaMa-7B
- CodeLLaMa-13B
- CodeLLaMa-34B
- StarCoder-15.5B
- DeepSeekCoder-1.3B
- DeepSeekCoder-6.7B
- DeepSeekCoder-33B
- Mixtral-8x7B
- Phi-1.5
- Phi-2
- WizardCoder-1B
- WizardCoder-3B
- WizardCoder-15B
- WizardCoder-33B
- Magicoder-6.7B
Metrics
- Pass@1
- CErr% (compilation/syntax error %)
Datasets
- SAFIM
- HumanEval-Infilling
- HumanEval
- The Stack
Benchmarks
- SAFIM
Context Entities
Models
- Codex
- CodeGen
- InCoder
- CodeLLaMa
- StarCoder
Metrics
- Pass@k
- Exact match
- Syntax match
- Execution-based pass
Datasets
- Codeforces
- GitHub
Benchmarks
- HumanEval
- APPS
- HumanEval-Infilling

