ArcGPT — a 7B LLM and AMBLE benchmark built for real archival tasks

July 27, 20236 min

Overview

Production Readiness

0.4

Novelty Score

0.5

Cost Impact Score

0.4

Citation Count

2

Authors

Shitou Zhang, Jingrui Hou, Siyuan Peng, Zuchao Li, Qibiao Hu, Ping Wang

Links

Abstract / PDF

Why It Matters For Business

ArcGPT and AMBLE let archives and data teams automate labeling and access decisions using a model trained on archive language; expect faster triage but verify with a predictive model for critical classification.

Summary TLDR

This paper introduces ArcGPT, a 7-billion-parameter language model pretrained on archival documents, and AMBLE, a four-task benchmark for archival work (retention period, open-access, confidentiality, post-OCR). ArcGPT improves classification F1 scores on AMBLE’s three label tasks (≈84–94 F1) compared to other generative LLMs, but lags behind specialized predictive models and underperforms specialist post-OCR systems. The dataset was built from records obtained from a Chinese administrative archive and annotated by archival students.

Problem Statement

Archives contain huge volumes of domain-specific documents that are costly to process manually. Generic LLMs struggle with archival jargon, historical phrasing, and archival workflows. There was no public archival-domain LLM or benchmark to measure progress on archive-specific tasks.

Main Contribution

ArcGPT: a 7B model pretrained on large archival-domain corpora to handle archival language and contexts.

AMBLE: a multi-task archival benchmark covering retention period prediction, open-access identification, confidentiality prediction, and post-OCR correction.

Evaluation comparing ArcGPT to several Chinese and bilingual baselines across AMBLE, with detailed metrics and error analysis.

Key Findings

ArcGPT achieves strong classification performance on archival label tasks.

NumbersF1 = 84.40 (retention), 84.00 (open-access), 94.40 (confidentiality)

Specialized predictive models still outperform ArcGPT on AMBLE classification.

NumbersRoBERTa-wwm-ext F1 = 88.80, 88.00, 97.20 (three tasks reported)

ArcGPT performs poorly on post-OCR correction versus specialist CSC models.

NumbersLevenshtein distance ArcGPT = 38.86 vs Mengzi-T5-Base-csc = 10.90

Results

Retention period prediction F1 (ArcGPT)

ValueF1 = 84.40

BaselineRoBERTa-wwm-ext reported best F1 = 88.80

Open-access identification F1 (ArcGPT)

ValueF1 = 84.00

BaselineRoBERTa-wwm-ext reported best F1 = 88.00

Confidentiality prediction F1 (ArcGPT)

ValueF1 = 94.40

BaselineRoBERTa-wwm-ext reported best F1 = 97.20

Post-OCR correction (Levenshtein distance)

ValueArcGPT = 38.86 (lower is better)

BaselineMengzi-T5-Base-csc = 10.90 (best)

Who Should Care

What To Try In 7 Days

Run ArcGPT on a sample of your archival records to auto-label retention, access, and confidentiality flags.

Compare ArcGPT labels to your current rule-based or classifier outputs to find mismatches for focused auditing.

Use specialist CSC models (Mengzi-T5 or BART-csc) for OCR cleanup; do not replace them with ArcGPT yet.

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • AMBLE data comes from one administrative archive in China; generalization to other archives or languages is untested.
  • ArcGPT lags behind strong predictive classifiers for classification and performs poorly on post-OCR without CSC fine-tuning.
  • Paper gives no public code link and provides limited details on pretraining data size and exact training recipe.

When Not To Use

  • Do not use ArcGPT alone for post-OCR cleanup or high-accuracy OCR pipelines.
  • Avoid deploying ArcGPT as sole classifier for legally critical confidentiality decisions without a validated predictive model or human review.
  • Not suitable when archives are in languages or styles not represented in the training data.

Failure Modes

  • Hallucinated or incorrect corrections in noisy OCR text leading to corrupted records.
  • Classification mistakes where specialized predictive models or rule sets outperform ArcGPT.
  • Overfitting to administrative styles from the source archive, causing poor transfer.

Core Entities

Models

  • ArcGPT
  • BatGPT
  • ChatGLM-6B
  • Chinese-LLaMA-Alpaca
  • BERT-wwm-ext
  • RoBERTa-wwm-ext
  • BART-Large-csc
  • Mengzi-T5-Base-csc

Metrics

  • Precision
  • Recall
  • F1
  • Levenshtein Distance

Datasets

  • AMBLE
  • SIGHAN
  • Wang271K

Benchmarks

  • AMBLE