Overview
Production Readiness
0.3
Novelty Score
0.35
Cost Impact Score
0.5
Citation Count
4
Why It Matters For Business
Low information quality in LLM outputs can cause bad decisions, legal risk, and user distrust; measuring and filtering quality reduces downstream risk and saves money on remediation.
Summary TLDR
This paper reviews how data quality, tokenization, and training scale drive trust problems in large language models (LLMs). It proposes a simple, domain‑agnostic information quality score that combines accuracy, consistency and relevance as weighted factors. The authors survey tokenizers, datasets, scaling laws (Chinchilla, Broken Neural Scaling Laws), and practical mitigations such as filtering, de-duplication, human feedback, and retrieval-based checks. No new experiments or code are released.
Problem Statement
LLMs produce useful text but also unreliable, biased or fabricated outputs. The paper argues information quality failures trace to training data issues (noise, bias, tokenization, duplication), scaling choices, and gaps in verification. It proposes a compact, tunable metric to quantify information quality of generated text.
Main Contribution
Survey of data-quality drivers of LLM trust: tokenization, bias, duplication, dataset mix, and scaling effects.
Proposal of a simple, linear information-quality metric combining accuracy, consistency and relevance with tunable weights.
Discussion of practical data-preprocessing steps (filtering, deduplication, privacy removal) and existing defenses (RLHF, retrieval, hallucination detectors).
Summary of scaling laws (Chinchilla, Broken Neural Scaling Laws) and their implications for dataset vs model sizing.
Key Findings
LLM information quality can be expressed as a weighted sum of three dimensions: accuracy, consistency, relevance.
Large-scale training and dataset choice matter: GPT-3 is cited as 175B parameters trained on ~570 GB of text.
Commonsense QA performance of LLMs is substantially below humans on cited datasets.
Noisy or duplicated training data degrades model quality; removing duplicates improves performance (cited work).
Results
Accuracy
Validation perplexity / BLEU on noisy SBNATION data
Model scale example
Who Should Care
What To Try In 7 Days
Implement the paper's simple IQ score (weights for accuracy, consistency, relevance) to triage outputs.
Run a quick audit of tokenization settings and de-duplicate your training/ingestion corpora.
Add a retrieval or browser-check step (search or WebGPT-style) for high-impact queries.
Optimization Features
Token Efficiency
- Highlights tokenization choice (BPE, WordPiece, unigram, character) affects model length and meaning
Infra Optimization
- Notes high compute and energy costs; suggests sparsely activated experts to reduce cost
Model Optimization
- MoE
Training Optimization
- Emphasizes compute-data scaling trade-offs (Chinchilla compute-optimal regime)
Reproducibility
Open Source Status
- unknown
Risks & Boundaries
Limitations
- No experiments or released code; proposals are conceptual and need validation.
- Proposed IQ metric is linear and simple; may miss other quality dimensions (safety, privacy, timeliness).
- Paper depends largely on cited work; limited original empirical evidence.
When Not To Use
- Do not use the paper's IQ metric as a proven filter in safety-critical systems without validation.
- Do not treat the survey as a benchmark of model performance; it summarizes literature rather than measure models.
Failure Modes
- IQ weights may be chosen poorly, letting biased or fluent-but-false text pass.
- Tokenization mismatches can introduce unseen subword artifacts and reduce factual accuracy.
- Relying on RLHF alone leaves gaps: costly to maintain and sensitive to label quality.
Core Entities
Models
- GPT-3
- GPT-4
- ChatGPT
- BERT
- LLaMA / Llama2
- Gopher
- PaLM
- BLOOM
- BART
- T5
- GLaM
Metrics
- perplexity
- BLEU
- Accuracy
- negative log-likelihood (loss)
Datasets
- CommonCrawl
- WebText / WebText2
- Books1/Books2
- Wikipedia
- The Pile
- SQuAD
- GLUE
- Reddit corpus
- GitHub code
- arXiv / scientific corpus
- WuDaoCorpora
Benchmarks
- TruthfulQA
- HaluEval
- CrowS-Pairs
- GLUE
- SQuAD
- Commonsense QA

