Overview
Paper shows consistent gains on a new, focused benchmark and provides training-data details, but evaluation is limited to the introduced MainframeBench and generated/curated data.
Citations1
Evidence Strength0.70
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 4/4
Findings with evidence refs: 4/4
Results with explicit delta: 3/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 70%
Production readiness: 60%
Novelty: 60%
Why It Matters For Business
A model specialized on COBOL and mainframe docs can cut developer triage and summarization work by producing more accurate summaries and answers about legacy code on evaluated tasks.
Who Should Care
Summary TLDR
XMainframe is a specialized code large language model trained and instruction-tuned for mainframe systems and COBOL. The team built a focused training corpus (~236M tokens), an instruction dataset (53k examples), and MainframeBench (MCQ, QA, COBOL summarization). On MainframeBench, XMainframe variants (7B and a depth-scaled 10.5B) outperform general and other code models on all three tasks, often by large margins in BLEU and accuracy. The repo is linked for code; dataset release is not specified.
Problem Statement
Mainframe systems (COBOL codebases) are business-critical but underrepresented in existing CodeLLMs and benchmarks. Off-the-shelf models lack enough COBOL/mainframe data and proper evaluation suites, making migration, interpretation, and validation of legacy modules hard for teams.
Main Contribution
A domain-specialized code LLM family (XMainframe) focused on COBOL and mainframe knowledge, with base and instruction-tuned variants.
A curated Mainframe-Training corpus (~236M tokens) and Mainframe-Instruct dataset (53,351 instruction examples) for fine-tuning.
Key Findings
Highest multiple-choice accuracy on MainframeBench (XMainframe-Instruct 10.5B).
Stronger question-answer quality by BLEU and retrieval metrics.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Accuracy | 77.89% (XMainframe-Instruct 10.5B) | 73.9% (GPT-4); 74.56% (GPT-3.5); 53.29% (DeepSeek-Coder-Instruct 33B) | +3.99 to +24.6 points vs listed baselines | MainframeBench - MCQ | Table 2: XMainframe-Instruct 10.5B = 77.89% accuracy | Table 2 |
| Question Answering (BLEU-4) | 20.93 (XMainframe-Instruct 10.5B) | 11.39 (Mixtral-Instruct 8x7B); 7.36 (GPT-3.5) | +9.54 BLEU vs Mixtral; +13.57 vs GPT-3.5 | MainframeBench - QA | Table 3 BLEU-4 scores | Table 3 |
What To Try In 7 Days
Run XMainframe-instruct on a handful of COBOL routines to get developer-facing summaries and compare to existing notes.
Use XMainframe for initial QA on migrated modules to flag likely mismatches before manual review.
Benchmark XMainframe against your current tools on 20 representative COBOL functions to estimate time-savings.
Optimization Features
Model Optimization
System Optimization
Training Optimization
Reproducibility
Risks & Boundaries
Limitations
MainframeBench is the paper’s test set; gains are reported only on this benchmark and may not generalize to all real-world mainframe codebases.
Instruction data is partly synthetic and ranked by GPT-4, which can introduce evaluator bias.
When Not To Use
When you need support for non-mainframe languages or broad multi-language code tasks.
For safety-critical migration checks without human verification; automatic outputs should be reviewed.
Failure Modes
Hallucinated or incorrect answers in QA when the prompt requires external execution or full program reasoning.
Overfitting to patterns in the curated COBOL corpus and producing summaries that mirror training phrasing.

