ArcGPT — a 7B LLM and AMBLE benchmark built for real archival tasks

July 27, 20236 min

Overview

Decision SnapshotNeeds Validation

ArcGPT shows clear domain gains for classification on AMBLE but is a research prototype with limited public artifacts and subpar OCR correction; production use needs further validation and specialist components.

Citations2

Evidence Strength0.60

Confidence0.78

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 40%

Production readiness: 40%

Novelty: 50%

Authors

Shitou Zhang, Jingrui Hou, Siyuan Peng, Zuchao Li, Qibiao Hu, Ping Wang

Links

Abstract / PDF

Why It Matters For Business

ArcGPT and AMBLE let archives and data teams automate labeling and access decisions using a model trained on archive language; expect faster triage but verify with a predictive model for critical classification.

Who Should Care

Summary TLDR

This paper introduces ArcGPT, a 7-billion-parameter language model pretrained on archival documents, and AMBLE, a four-task benchmark for archival work (retention period, open-access, confidentiality, post-OCR). ArcGPT improves classification F1 scores on AMBLE’s three label tasks (≈84–94 F1) compared to other generative LLMs, but lags behind specialized predictive models and underperforms specialist post-OCR systems. The dataset was built from records obtained from a Chinese administrative archive and annotated by archival students.

Problem Statement

Archives contain huge volumes of domain-specific documents that are costly to process manually. Generic LLMs struggle with archival jargon, historical phrasing, and archival workflows. There was no public archival-domain LLM or benchmark to measure progress on archive-specific tasks.

Main Contribution

ArcGPT: a 7B model pretrained on large archival-domain corpora to handle archival language and contexts.

AMBLE: a multi-task archival benchmark covering retention period prediction, open-access identification, confidentiality prediction, and post-OCR correction.

Key Findings

ArcGPT achieves strong classification performance on archival label tasks.

NumbersF1 = 84.40 (retention), 84.00 (open-access), 94.40 (confidentiality)

Practical UseUse ArcGPT as a practical starting model for archive classification workflows; expect near- state-of-the-art generative performance but compare with predictive classifiers for top accuracy.

Evidence RefTable 2; Section 5.2.1

Specialized predictive models still outperform ArcGPT on AMBLE classification.

NumbersRoBERTa-wwm-ext F1 = 88.80, 88.00, 97.20 (three tasks reported)

Practical UseFor production systems where max classification accuracy matters, prefer or ensemble strong predictive models (RoBERTa variants) with ArcGPT rather than replacing them outright.

Evidence RefSection 5.2.1 (comparison statements)

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Retention period prediction F1 (ArcGPT)F1 = 84.40RoBERTa-wwm-ext reported best F1 = 88.80≈ -4.4 F1 vs reported best predictive modelAMBLE testArcGPT row, Table 2; Section 5.2.1Table 2
Open-access identification F1 (ArcGPT)F1 = 84.00RoBERTa-wwm-ext reported best F1 = 88.00≈ -4.0 F1 vs reported best predictive modelAMBLE testArcGPT row, Table 2; Section 5.2.1Table 2

What To Try In 7 Days

Run ArcGPT on a sample of your archival records to auto-label retention, access, and confidentiality flags.

Compare ArcGPT labels to your current rule-based or classifier outputs to find mismatches for focused auditing.

Use specialist CSC models (Mengzi-T5 or BART-csc) for OCR cleanup; do not replace them with ArcGPT yet.

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

AMBLE data comes from one administrative archive in China; generalization to other archives or languages is untested.

ArcGPT lags behind strong predictive classifiers for classification and performs poorly on post-OCR without CSC fine-tuning.

When Not To Use

Do not use ArcGPT alone for post-OCR cleanup or high-accuracy OCR pipelines.

Avoid deploying ArcGPT as sole classifier for legally critical confidentiality decisions without a validated predictive model or human review.

Failure Modes

Hallucinated or incorrect corrections in noisy OCR text leading to corrupted records.

Classification mistakes where specialized predictive models or rule sets outperform ArcGPT.

Core Entities

Models

ArcGPTBatGPTChatGLM-6BChinese-LLaMA-AlpacaBERT-wwm-extRoBERTa-wwm-extBART-Large-cscMengzi-T5-Base-csc

Metrics

PrecisionRecallF1Levenshtein Distance

Datasets

AMBLESIGHANWang271K

Benchmarks

AMBLE