Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.7
Citation Count
1
Why It Matters For Business
A model specialized on COBOL and mainframe docs can cut developer triage and summarization work by producing more accurate summaries and answers about legacy code on evaluated tasks.
Summary TLDR
XMainframe is a specialized code large language model trained and instruction-tuned for mainframe systems and COBOL. The team built a focused training corpus (~236M tokens), an instruction dataset (53k examples), and MainframeBench (MCQ, QA, COBOL summarization). On MainframeBench, XMainframe variants (7B and a depth-scaled 10.5B) outperform general and other code models on all three tasks, often by large margins in BLEU and accuracy. The repo is linked for code; dataset release is not specified.
Problem Statement
Mainframe systems (COBOL codebases) are business-critical but underrepresented in existing CodeLLMs and benchmarks. Off-the-shelf models lack enough COBOL/mainframe data and proper evaluation suites, making migration, interpretation, and validation of legacy modules hard for teams.
Main Contribution
A domain-specialized code LLM family (XMainframe) focused on COBOL and mainframe knowledge, with base and instruction-tuned variants.
A curated Mainframe-Training corpus (~236M tokens) and Mainframe-Instruct dataset (53,351 instruction examples) for fine-tuning.
MainframeBench: a benchmark with three subtasks—Multiple Choice Questions, Question Answering, and COBOL code summarization.
A depth up-scaling recipe to create a 10.5B XMainframe model from a 7B DeepSeek-Coder checkpoint.
Empirical evaluation showing XMainframe outperforms several general and code LLMs on the new benchmark.
Key Findings
Highest multiple-choice accuracy on MainframeBench (XMainframe-Instruct 10.5B).
Stronger question-answer quality by BLEU and retrieval metrics.
Large win on COBOL code summarization.
Focused training data scale and composition.
Results
Accuracy
Question Answering (BLEU-4)
COBOL summarization (BLEU-4)
Training corpus size
Who Should Care
What To Try In 7 Days
Run XMainframe-instruct on a handful of COBOL routines to get developer-facing summaries and compare to existing notes.
Use XMainframe for initial QA on migrated modules to flag likely mismatches before manual review.
Benchmark XMainframe against your current tools on 20 representative COBOL functions to estimate time-savings.
Optimization Features
Model Optimization
- Depth up-scaling to 10.5B from 7B (duplicate+splice layers)
System Optimization
- RoPE to extend context window to 16K
Training Optimization
- FlashAttention 2 for faster attention
- Instruction tuning for three epochs on Mainframe-Instruct
Reproducibility
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- MainframeBench is the paper’s test set; gains are reported only on this benchmark and may not generalize to all real-world mainframe codebases.
- Instruction data is partly synthetic and ranked by GPT-4, which can introduce evaluator bias.
- Public release status of the training and instruction datasets is not confirmed in the paper.
- No reported live production or longitudinal deployments to validate developer workflow improvements.
When Not To Use
- When you need support for non-mainframe languages or broad multi-language code tasks.
- For safety-critical migration checks without human verification; automatic outputs should be reviewed.
- If you require fully open-source datasets and reproducible training artifacts (dataset release unclear).
Failure Modes
- Hallucinated or incorrect answers in QA when the prompt requires external execution or full program reasoning.
- Overfitting to patterns in the curated COBOL corpus and producing summaries that mirror training phrasing.
- Evaluation bias due to GPT-4 involvement in data ranking and synthetic data generation.
Core Entities
Models
- XMainframe-base
- XMainframe-instruct-7B
- XMainframe-instruct-10.5B
- DeepSeek-Coder (base)
- DeepSeek-Coder-Instruct 6.7B
- DeepSeek-Coder-Instruct 33B
- Mixtral-Instruct 8x7B
- Mistral-Instruct 7B
- GPT-3.5
- GPT-4
- Neural-Chat
Metrics
- Accuracy
- BLEU-4
- MAP
- F1-Score
- BERTScore
- RougeL
- Meteor
Datasets
- Mainframe-Training Dataset
- Mainframe-Instruct Dataset
- MainframeBench
Benchmarks
- MainframeBench

