A COBOL- and mainframe-specialized LLM plus a MainframeBench to evaluate modernization tasks

August 5, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.7

Citation Count

1

Authors

Anh T. V. Dau, Hieu Trung Dao, Anh Tuan Nguyen, Hieu Trung Tran, Phong X. Nguyen, Nghi D. Q. Bui

Links

Abstract / PDF

Why It Matters For Business

A model specialized on COBOL and mainframe docs can cut developer triage and summarization work by producing more accurate summaries and answers about legacy code on evaluated tasks.

Summary TLDR

XMainframe is a specialized code large language model trained and instruction-tuned for mainframe systems and COBOL. The team built a focused training corpus (~236M tokens), an instruction dataset (53k examples), and MainframeBench (MCQ, QA, COBOL summarization). On MainframeBench, XMainframe variants (7B and a depth-scaled 10.5B) outperform general and other code models on all three tasks, often by large margins in BLEU and accuracy. The repo is linked for code; dataset release is not specified.

Problem Statement

Mainframe systems (COBOL codebases) are business-critical but underrepresented in existing CodeLLMs and benchmarks. Off-the-shelf models lack enough COBOL/mainframe data and proper evaluation suites, making migration, interpretation, and validation of legacy modules hard for teams.

Main Contribution

A domain-specialized code LLM family (XMainframe) focused on COBOL and mainframe knowledge, with base and instruction-tuned variants.

A curated Mainframe-Training corpus (~236M tokens) and Mainframe-Instruct dataset (53,351 instruction examples) for fine-tuning.

MainframeBench: a benchmark with three subtasks—Multiple Choice Questions, Question Answering, and COBOL code summarization.

A depth up-scaling recipe to create a 10.5B XMainframe model from a 7B DeepSeek-Coder checkpoint.

Empirical evaluation showing XMainframe outperforms several general and code LLMs on the new benchmark.

Key Findings

Highest multiple-choice accuracy on MainframeBench (XMainframe-Instruct 10.5B).

Numbers77.89% accuracy (XMainframe 10.5B) vs 73.9% (GPT-4) and 53.29% (DeepSeek-Coder-Instruct 33B) on the MCQ split (Table 2).

Stronger question-answer quality by BLEU and retrieval metrics.

NumbersQA BLEU-4 of 20.93 for XMainframe-Instruct 10.5B vs 11.39 for Mixtral-Instruct 8x7B and 7.36 for GPT-3.5 (Table 3).

Large win on COBOL code summarization.

NumbersSummarization BLEU-4 = 62.58 (XMainframe 10.5B) vs 11.37 (GPT-3.5) and 7.42 (GPT-4) on MainframeBench (Table 4).

Focused training data scale and composition.

Numbers33,561 COBOL files → 228M tokens; combined training corpus ~236M tokens (Section 3.1).

Results

Accuracy

Value77.89% (XMainframe-Instruct 10.5B)

Baseline73.9% (GPT-4); 74.56% (GPT-3.5); 53.29% (DeepSeek-Coder-Instruct 33B)

Question Answering (BLEU-4)

Value20.93 (XMainframe-Instruct 10.5B)

Baseline11.39 (Mixtral-Instruct 8x7B); 7.36 (GPT-3.5)

COBOL summarization (BLEU-4)

Value62.58 (XMainframe-Instruct 10.5B)

Baseline11.37 (GPT-3.5); 7.42 (GPT-4)

Training corpus size

Value236 million tokens (total); 228M tokens from 33,561 COBOL files

Who Should Care

What To Try In 7 Days

Run XMainframe-instruct on a handful of COBOL routines to get developer-facing summaries and compare to existing notes.

Use XMainframe for initial QA on migrated modules to flag likely mismatches before manual review.

Benchmark XMainframe against your current tools on 20 representative COBOL functions to estimate time-savings.

Optimization Features

Model Optimization

  • Depth up-scaling to 10.5B from 7B (duplicate+splice layers)

System Optimization

  • RoPE to extend context window to 16K

Training Optimization

  • FlashAttention 2 for faster attention
  • Instruction tuning for three epochs on Mainframe-Instruct

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • MainframeBench is the paper’s test set; gains are reported only on this benchmark and may not generalize to all real-world mainframe codebases.
  • Instruction data is partly synthetic and ranked by GPT-4, which can introduce evaluator bias.
  • Public release status of the training and instruction datasets is not confirmed in the paper.
  • No reported live production or longitudinal deployments to validate developer workflow improvements.

When Not To Use

  • When you need support for non-mainframe languages or broad multi-language code tasks.
  • For safety-critical migration checks without human verification; automatic outputs should be reviewed.
  • If you require fully open-source datasets and reproducible training artifacts (dataset release unclear).

Failure Modes

  • Hallucinated or incorrect answers in QA when the prompt requires external execution or full program reasoning.
  • Overfitting to patterns in the curated COBOL corpus and producing summaries that mirror training phrasing.
  • Evaluation bias due to GPT-4 involvement in data ranking and synthetic data generation.

Core Entities

Models

  • XMainframe-base
  • XMainframe-instruct-7B
  • XMainframe-instruct-10.5B
  • DeepSeek-Coder (base)
  • DeepSeek-Coder-Instruct 6.7B
  • DeepSeek-Coder-Instruct 33B
  • Mixtral-Instruct 8x7B
  • Mistral-Instruct 7B
  • GPT-3.5
  • GPT-4
  • Neural-Chat

Metrics

  • Accuracy
  • BLEU-4
  • MAP
  • F1-Score
  • BERTScore
  • RougeL
  • Meteor

Datasets

  • Mainframe-Training Dataset
  • Mainframe-Instruct Dataset
  • MainframeBench

Benchmarks

  • MainframeBench