ArcGPT — a 7B LLM and AMBLE benchmark built for real archival tasks

Overview

Decision SnapshotNeeds Validation

ArcGPT shows clear domain gains for classification on AMBLE but is a research prototype with limited public artifacts and subpar OCR correction; production use needs further validation and specialist components.

Citations2

Evidence Strength0.60

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 50%

Authors

Shitou Zhang, Jingrui Hou, Siyuan Peng, Zuchao Li, Qibiao Hu, Ping Wang

Links

Abstract / PDF

Why It Matters For Business

ArcGPT and AMBLE let archives and data teams automate labeling and access decisions using a model trained on archive language; expect faster triage but verify with a predictive model for critical classification.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

This paper introduces ArcGPT, a 7-billion-parameter language model pretrained on archival documents, and AMBLE, a four-task benchmark for archival work (retention period, open-access, confidentiality, post-OCR). ArcGPT improves classification F1 scores on AMBLE’s three label tasks (≈84–94 F1) compared to other generative LLMs, but lags behind specialized predictive models and underperforms specialist post-OCR systems. The dataset was built from records obtained from a Chinese administrative archive and annotated by archival students.

Problem Statement

Archives contain huge volumes of domain-specific documents that are costly to process manually. Generic LLMs struggle with archival jargon, historical phrasing, and archival workflows. There was no public archival-domain LLM or benchmark to measure progress on archive-specific tasks.

Main Contribution

ArcGPT: a 7B model pretrained on large archival-domain corpora to handle archival language and contexts.

AMBLE: a multi-task archival benchmark covering retention period prediction, open-access identification, confidentiality prediction, and post-OCR correction.

Key Findings

ArcGPT achieves strong classification performance on archival label tasks.

NumbersF1 = 84.40 (retention), 84.00 (open-access), 94.40 (confidentiality)

Practical UseUse ArcGPT as a practical starting model for archive classification workflows; expect near- state-of-the-art generative performance but compare with predictive classifiers for top accuracy.

Evidence RefTable 2; Section 5.2.1

Specialized predictive models still outperform ArcGPT on AMBLE classification.

NumbersRoBERTa-wwm-ext F1 = 88.80, 88.00, 97.20 (three tasks reported)

Practical UseFor production systems where max classification accuracy matters, prefer or ensemble strong predictive models (RoBERTa variants) with ArcGPT rather than replacing them outright.

Evidence RefSection 5.2.1 (comparison statements)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Retention period prediction F1 (ArcGPT)	F1 = 84.40	RoBERTa-wwm-ext reported best F1 = 88.80	≈ -4.4 F1 vs reported best predictive model	AMBLE test	ArcGPT row, Table 2; Section 5.2.1	Table 2
Open-access identification F1 (ArcGPT)	F1 = 84.00	RoBERTa-wwm-ext reported best F1 = 88.00	≈ -4.0 F1 vs reported best predictive model	AMBLE test	ArcGPT row, Table 2; Section 5.2.1	Table 2

What To Try In 7 Days

Run ArcGPT on a sample of your archival records to auto-label retention, access, and confidentiality flags.

Compare ArcGPT labels to your current rule-based or classifier outputs to find mismatches for focused auditing.

Use specialist CSC models (Mengzi-T5 or BART-csc) for OCR cleanup; do not replace them with ArcGPT yet.

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

AMBLE data comes from one administrative archive in China; generalization to other archives or languages is untested.

ArcGPT lags behind strong predictive classifiers for classification and performs poorly on post-OCR without CSC fine-tuning.

When Not To Use

Do not use ArcGPT alone for post-OCR cleanup or high-accuracy OCR pipelines.

Avoid deploying ArcGPT as sole classifier for legally critical confidentiality decisions without a validated predictive model or human review.

Failure Modes

Hallucinated or incorrect corrections in noisy OCR text leading to corrupted records.

Classification mistakes where specialized predictive models or rule sets outperform ArcGPT.

Core Entities

Models

ArcGPTBatGPTChatGLM-6BChinese-LLaMA-AlpacaBERT-wwm-extRoBERTa-wwm-extBART-Large-cscMengzi-T5-Base-csc

Metrics

PrecisionRecallF1Levenshtein Distance

Datasets

AMBLESIGHANWang271K

Benchmarks

AMBLE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

ArcGPT achieves strong classification performance on archival label tasks.

Specialized predictive models still outperform ArcGPT on AMBLE classification.

Results

What To Try In 7 Days

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding