Overview
ArcGPT shows clear domain gains for classification on AMBLE but is a research prototype with limited public artifacts and subpar OCR correction; production use needs further validation and specialist components.
Citations2
Evidence Strength0.60
Confidence0.78
Risk Signals9
Trust Signals
Findings with numeric evidence: 3/3
Findings with evidence refs: 3/3
Results with explicit delta: 4/4
Reproducibility
Status: Partial assets available
Open source: Partial
At A Glance
Cost impact: 40%
Production readiness: 40%
Novelty: 50%
Why It Matters For Business
ArcGPT and AMBLE let archives and data teams automate labeling and access decisions using a model trained on archive language; expect faster triage but verify with a predictive model for critical classification.
Who Should Care
Summary TLDR
This paper introduces ArcGPT, a 7-billion-parameter language model pretrained on archival documents, and AMBLE, a four-task benchmark for archival work (retention period, open-access, confidentiality, post-OCR). ArcGPT improves classification F1 scores on AMBLE’s three label tasks (≈84–94 F1) compared to other generative LLMs, but lags behind specialized predictive models and underperforms specialist post-OCR systems. The dataset was built from records obtained from a Chinese administrative archive and annotated by archival students.
Problem Statement
Archives contain huge volumes of domain-specific documents that are costly to process manually. Generic LLMs struggle with archival jargon, historical phrasing, and archival workflows. There was no public archival-domain LLM or benchmark to measure progress on archive-specific tasks.
Main Contribution
ArcGPT: a 7B model pretrained on large archival-domain corpora to handle archival language and contexts.
AMBLE: a multi-task archival benchmark covering retention period prediction, open-access identification, confidentiality prediction, and post-OCR correction.
Key Findings
ArcGPT achieves strong classification performance on archival label tasks.
Specialized predictive models still outperform ArcGPT on AMBLE classification.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Retention period prediction F1 (ArcGPT) | F1 = 84.40 | RoBERTa-wwm-ext reported best F1 = 88.80 | ≈ -4.4 F1 vs reported best predictive model | AMBLE test | ArcGPT row, Table 2; Section 5.2.1 | Table 2 |
| Open-access identification F1 (ArcGPT) | F1 = 84.00 | RoBERTa-wwm-ext reported best F1 = 88.00 | ≈ -4.0 F1 vs reported best predictive model | AMBLE test | ArcGPT row, Table 2; Section 5.2.1 | Table 2 |
What To Try In 7 Days
Run ArcGPT on a sample of your archival records to auto-label retention, access, and confidentiality flags.
Compare ArcGPT labels to your current rule-based or classifier outputs to find mismatches for focused auditing.
Use specialist CSC models (Mengzi-T5 or BART-csc) for OCR cleanup; do not replace them with ArcGPT yet.
Reproducibility
Risks & Boundaries
Limitations
AMBLE data comes from one administrative archive in China; generalization to other archives or languages is untested.
ArcGPT lags behind strong predictive classifiers for classification and performs poorly on post-OCR without CSC fine-tuning.
When Not To Use
Do not use ArcGPT alone for post-OCR cleanup or high-accuracy OCR pipelines.
Avoid deploying ArcGPT as sole classifier for legally critical confidentiality decisions without a validated predictive model or human review.
Failure Modes
Hallucinated or incorrect corrections in noisy OCR text leading to corrupted records.
Classification mistakes where specialized predictive models or rule sets outperform ArcGPT.

