A COBOL- and mainframe-specialized LLM plus a MainframeBench to evaluate modernization tasks

Overview

Decision SnapshotNeeds Validation

Paper shows consistent gains on a new, focused benchmark and provides training-data details, but evaluation is limited to the introduced MainframeBench and generated/curated data.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Anh T. V. Dau, Hieu Trung Dao, Anh Tuan Nguyen, Hieu Trung Tran, Phong X. Nguyen, Nghi D. Q. Bui

Links

Abstract / PDF / Code

Why It Matters For Business

A model specialized on COBOL and mainframe docs can cut developer triage and summarization work by producing more accurate summaries and answers about legacy code on evaluated tasks.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

XMainframe is a specialized code large language model trained and instruction-tuned for mainframe systems and COBOL. The team built a focused training corpus (~236M tokens), an instruction dataset (53k examples), and MainframeBench (MCQ, QA, COBOL summarization). On MainframeBench, XMainframe variants (7B and a depth-scaled 10.5B) outperform general and other code models on all three tasks, often by large margins in BLEU and accuracy. The repo is linked for code; dataset release is not specified.

Problem Statement

Mainframe systems (COBOL codebases) are business-critical but underrepresented in existing CodeLLMs and benchmarks. Off-the-shelf models lack enough COBOL/mainframe data and proper evaluation suites, making migration, interpretation, and validation of legacy modules hard for teams.

Main Contribution

A domain-specialized code LLM family (XMainframe) focused on COBOL and mainframe knowledge, with base and instruction-tuned variants.

A curated Mainframe-Training corpus (~236M tokens) and Mainframe-Instruct dataset (53,351 instruction examples) for fine-tuning.

Key Findings

Highest multiple-choice accuracy on MainframeBench (XMainframe-Instruct 10.5B).

Numbers77.89% accuracy (XMainframe 10.5B) vs 73.9% (GPT-4) and 53.29% (DeepSeek-Coder-Instruct 33B) on the MCQ split (Table 2).

Practical UseFor automated or assisted decision tasks about mainframe behavior, use XMainframe-10.5B to reduce manual checking; expect ~4–24 percentage points higher accuracy than strong baselines on this benchmark.

Evidence RefTable 2

Stronger question-answer quality by BLEU and retrieval metrics.

NumbersQA BLEU-4 of 20.93 for XMainframe-Instruct 10.5B vs 11.39 for Mixtral-Instruct 8x7B and 7.36 for GPT-3.5 (Table 3).

Practical UseFor short factual answers about mainframe code and behavior, XMainframe produces responses closer to references on the evaluated QA set; use it for developer-facing Q&A and initial triage.

Evidence RefTable 3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	77.89% (XMainframe-Instruct 10.5B)	73.9% (GPT-4); 74.56% (GPT-3.5); 53.29% (DeepSeek-Coder-Instruct 33B)	+3.99 to +24.6 points vs listed baselines	MainframeBench - MCQ	Table 2: XMainframe-Instruct 10.5B = 77.89% accuracy	Table 2
Question Answering (BLEU-4)	20.93 (XMainframe-Instruct 10.5B)	11.39 (Mixtral-Instruct 8x7B); 7.36 (GPT-3.5)	+9.54 BLEU vs Mixtral; +13.57 vs GPT-3.5	MainframeBench - QA	Table 3 BLEU-4 scores	Table 3

What To Try In 7 Days

Run XMainframe-instruct on a handful of COBOL routines to get developer-facing summaries and compare to existing notes.

Use XMainframe for initial QA on migrated modules to flag likely mismatches before manual review.

Benchmark XMainframe against your current tools on 20 representative COBOL functions to estimate time-savings.

Optimization Features

Model Optimization

Depth up-scaling to 10.5B from 7B (duplicate+splice layers)

System Optimization

RoPE to extend context window to 16K

Training Optimization

FlashAttention 2 for faster attentionInstruction tuning for three epochs on Mainframe-Instruct

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/FSoft-AI4Code/XMainframe

Risks & Boundaries

Limitations

MainframeBench is the paper’s test set; gains are reported only on this benchmark and may not generalize to all real-world mainframe codebases.

Instruction data is partly synthetic and ranked by GPT-4, which can introduce evaluator bias.

When Not To Use

When you need support for non-mainframe languages or broad multi-language code tasks.

For safety-critical migration checks without human verification; automatic outputs should be reviewed.

Failure Modes

Hallucinated or incorrect answers in QA when the prompt requires external execution or full program reasoning.

Overfitting to patterns in the curated COBOL corpus and producing summaries that mirror training phrasing.

Core Entities

Models

XMainframe-baseXMainframe-instruct-7BXMainframe-instruct-10.5BDeepSeek-Coder (base)DeepSeek-Coder-Instruct 6.7BDeepSeek-Coder-Instruct 33BMixtral-Instruct 8x7BMistral-Instruct 7BGPT-3.5GPT-4Neural-Chat

Metrics

AccuracyBLEU-4MAPF1-ScoreBERTScoreRougeLMeteor

Datasets

Mainframe-Training DatasetMainframe-Instruct DatasetMainframeBench

Benchmarks

MainframeBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Highest multiple-choice accuracy on MainframeBench (XMainframe-Instruct 10.5B).

Stronger question-answer quality by BLEU and retrieval metrics.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Chemistry foundation models power structure-focused multimodal RAG inside hierarchical multi-agent workflows

Key finding

Survey of financial LLMs: techniques, benchmarks, and practical gaps

Key finding

SNFinLLM: Chinese financial LLM with domain pretraining, instruction tuning, DPO alignment, and calculator integration

Key finding

PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

Key finding

ChipExpert: Open-source LLM tuned for integrated-circuit design

Key finding