A COBOL- and mainframe-specialized LLM plus a MainframeBench to evaluate modernization tasks

August 5, 20247 min

Overview

Decision SnapshotNeeds Validation

Paper shows consistent gains on a new, focused benchmark and provides training-data details, but evaluation is limited to the introduced MainframeBench and generated/curated data.

Citations1

Evidence Strength0.70

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 60%

Novelty: 60%

Authors

Anh T. V. Dau, Hieu Trung Dao, Anh Tuan Nguyen, Hieu Trung Tran, Phong X. Nguyen, Nghi D. Q. Bui

Links

Abstract / PDF / Code

Why It Matters For Business

A model specialized on COBOL and mainframe docs can cut developer triage and summarization work by producing more accurate summaries and answers about legacy code on evaluated tasks.

Who Should Care

Summary TLDR

XMainframe is a specialized code large language model trained and instruction-tuned for mainframe systems and COBOL. The team built a focused training corpus (~236M tokens), an instruction dataset (53k examples), and MainframeBench (MCQ, QA, COBOL summarization). On MainframeBench, XMainframe variants (7B and a depth-scaled 10.5B) outperform general and other code models on all three tasks, often by large margins in BLEU and accuracy. The repo is linked for code; dataset release is not specified.

Problem Statement

Mainframe systems (COBOL codebases) are business-critical but underrepresented in existing CodeLLMs and benchmarks. Off-the-shelf models lack enough COBOL/mainframe data and proper evaluation suites, making migration, interpretation, and validation of legacy modules hard for teams.

Main Contribution

A domain-specialized code LLM family (XMainframe) focused on COBOL and mainframe knowledge, with base and instruction-tuned variants.

A curated Mainframe-Training corpus (~236M tokens) and Mainframe-Instruct dataset (53,351 instruction examples) for fine-tuning.

Key Findings

Highest multiple-choice accuracy on MainframeBench (XMainframe-Instruct 10.5B).

Numbers77.89% accuracy (XMainframe 10.5B) vs 73.9% (GPT-4) and 53.29% (DeepSeek-Coder-Instruct 33B) on the MCQ split (Table 2).

Practical UseFor automated or assisted decision tasks about mainframe behavior, use XMainframe-10.5B to reduce manual checking; expect ~4–24 percentage points higher accuracy than strong baselines on this benchmark.

Evidence RefTable 2

Stronger question-answer quality by BLEU and retrieval metrics.

NumbersQA BLEU-4 of 20.93 for XMainframe-Instruct 10.5B vs 11.39 for Mixtral-Instruct 8x7B and 7.36 for GPT-3.5 (Table 3).

Practical UseFor short factual answers about mainframe code and behavior, XMainframe produces responses closer to references on the evaluated QA set; use it for developer-facing Q&A and initial triage.

Evidence RefTable 3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy77.89% (XMainframe-Instruct 10.5B)73.9% (GPT-4); 74.56% (GPT-3.5); 53.29% (DeepSeek-Coder-Instruct 33B)+3.99 to +24.6 points vs listed baselinesMainframeBench - MCQTable 2: XMainframe-Instruct 10.5B = 77.89% accuracyTable 2
Question Answering (BLEU-4)20.93 (XMainframe-Instruct 10.5B)11.39 (Mixtral-Instruct 8x7B); 7.36 (GPT-3.5)+9.54 BLEU vs Mixtral; +13.57 vs GPT-3.5MainframeBench - QATable 3 BLEU-4 scoresTable 3

What To Try In 7 Days

Run XMainframe-instruct on a handful of COBOL routines to get developer-facing summaries and compare to existing notes.

Use XMainframe for initial QA on migrated modules to flag likely mismatches before manual review.

Benchmark XMainframe against your current tools on 20 representative COBOL functions to estimate time-savings.

Optimization Features

Model Optimization
Depth up-scaling to 10.5B from 7B (duplicate+splice layers)
System Optimization
RoPE to extend context window to 16K
Training Optimization
FlashAttention 2 for faster attentionInstruction tuning for three epochs on Mainframe-Instruct

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

MainframeBench is the paper’s test set; gains are reported only on this benchmark and may not generalize to all real-world mainframe codebases.

Instruction data is partly synthetic and ranked by GPT-4, which can introduce evaluator bias.

When Not To Use

When you need support for non-mainframe languages or broad multi-language code tasks.

For safety-critical migration checks without human verification; automatic outputs should be reviewed.

Failure Modes

Hallucinated or incorrect answers in QA when the prompt requires external execution or full program reasoning.

Overfitting to patterns in the curated COBOL corpus and producing summaries that mirror training phrasing.

Core Entities

Models

XMainframe-baseXMainframe-instruct-7BXMainframe-instruct-10.5BDeepSeek-Coder (base)DeepSeek-Coder-Instruct 6.7BDeepSeek-Coder-Instruct 33BMixtral-Instruct 8x7BMistral-Instruct 7BGPT-3.5GPT-4Neural-Chat

Metrics

AccuracyBLEU-4MAPF1-ScoreBERTScoreRougeLMeteor

Datasets

Mainframe-Training DatasetMainframe-Instruct DatasetMainframeBench

Benchmarks

MainframeBench