Overview
Production Readiness
0.4
Novelty Score
0.5
Cost Impact Score
0.4
Citation Count
2
Why It Matters For Business
ArcGPT and AMBLE let archives and data teams automate labeling and access decisions using a model trained on archive language; expect faster triage but verify with a predictive model for critical classification.
Summary TLDR
This paper introduces ArcGPT, a 7-billion-parameter language model pretrained on archival documents, and AMBLE, a four-task benchmark for archival work (retention period, open-access, confidentiality, post-OCR). ArcGPT improves classification F1 scores on AMBLE’s three label tasks (≈84–94 F1) compared to other generative LLMs, but lags behind specialized predictive models and underperforms specialist post-OCR systems. The dataset was built from records obtained from a Chinese administrative archive and annotated by archival students.
Problem Statement
Archives contain huge volumes of domain-specific documents that are costly to process manually. Generic LLMs struggle with archival jargon, historical phrasing, and archival workflows. There was no public archival-domain LLM or benchmark to measure progress on archive-specific tasks.
Main Contribution
ArcGPT: a 7B model pretrained on large archival-domain corpora to handle archival language and contexts.
AMBLE: a multi-task archival benchmark covering retention period prediction, open-access identification, confidentiality prediction, and post-OCR correction.
Evaluation comparing ArcGPT to several Chinese and bilingual baselines across AMBLE, with detailed metrics and error analysis.
Key Findings
ArcGPT achieves strong classification performance on archival label tasks.
Specialized predictive models still outperform ArcGPT on AMBLE classification.
ArcGPT performs poorly on post-OCR correction versus specialist CSC models.
Results
Retention period prediction F1 (ArcGPT)
Open-access identification F1 (ArcGPT)
Confidentiality prediction F1 (ArcGPT)
Post-OCR correction (Levenshtein distance)
Who Should Care
What To Try In 7 Days
Run ArcGPT on a sample of your archival records to auto-label retention, access, and confidentiality flags.
Compare ArcGPT labels to your current rule-based or classifier outputs to find mismatches for focused auditing.
Use specialist CSC models (Mengzi-T5 or BART-csc) for OCR cleanup; do not replace them with ArcGPT yet.
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- AMBLE data comes from one administrative archive in China; generalization to other archives or languages is untested.
- ArcGPT lags behind strong predictive classifiers for classification and performs poorly on post-OCR without CSC fine-tuning.
- Paper gives no public code link and provides limited details on pretraining data size and exact training recipe.
When Not To Use
- Do not use ArcGPT alone for post-OCR cleanup or high-accuracy OCR pipelines.
- Avoid deploying ArcGPT as sole classifier for legally critical confidentiality decisions without a validated predictive model or human review.
- Not suitable when archives are in languages or styles not represented in the training data.
Failure Modes
- Hallucinated or incorrect corrections in noisy OCR text leading to corrupted records.
- Classification mistakes where specialized predictive models or rule sets outperform ArcGPT.
- Overfitting to administrative styles from the source archive, causing poor transfer.
Core Entities
Models
- ArcGPT
- BatGPT
- ChatGLM-6B
- Chinese-LLaMA-Alpaca
- BERT-wwm-ext
- RoBERTa-wwm-ext
- BART-Large-csc
- Mengzi-T5-Base-csc
Metrics
- Precision
- Recall
- F1
- Levenshtein Distance
Datasets
- AMBLE
- SIGHAN
- Wang271K
Benchmarks
- AMBLE

