A short review plus a simple scoring formula to judge LLM output quality

Overview

Decision SnapshotNeeds Validation

This is a survey plus a simple scoring idea without new experiments; actionable but requires validation in real systems.

Citations4

Evidence Strength0.40

Confidence0.70

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 35%

Authors

Rick Rejeleene, Xiaowei Xu, John Talburt

Links

Abstract / PDF

Why It Matters For Business

Low information quality in LLM outputs can cause bad decisions, legal risk, and user distrust; measuring and filtering quality reduces downstream risk and saves money on remediation.

Who Should Care

CTO Product Manager ML Engineer Data Scientist CEO

Summary TLDR

This paper reviews how data quality, tokenization, and training scale drive trust problems in large language models (LLMs). It proposes a simple, domain‑agnostic information quality score that combines accuracy, consistency and relevance as weighted factors. The authors survey tokenizers, datasets, scaling laws (Chinchilla, Broken Neural Scaling Laws), and practical mitigations such as filtering, de-duplication, human feedback, and retrieval-based checks. No new experiments or code are released.

Problem Statement

LLMs produce useful text but also unreliable, biased or fabricated outputs. The paper argues information quality failures trace to training data issues (noise, bias, tokenization, duplication), scaling choices, and gaps in verification. It proposes a compact, tunable metric to quantify information quality of generated text.

Main Contribution

Survey of data-quality drivers of LLM trust: tokenization, bias, duplication, dataset mix, and scaling effects.

Proposal of a simple, linear information-quality metric combining accuracy, consistency and relevance with tunable weights.

Key Findings

LLM information quality can be expressed as a weighted sum of three dimensions: accuracy, consistency, relevance.

Practical UseImplement a simple weighted score (custom weights per use case) to rank or filter model outputs before downstream use.

Evidence RefSection 3 and Mathematical formulation of Information Quality Evaluation

Large-scale training and dataset choice matter: GPT-3 is cited as 175B parameters trained on ~570 GB of text.

Numbers175B params; 570 GB training data

Practical UseWhen comparing models, include both parameter count and data scale; bigger models alone do not guarantee better, verifiable answers.

Evidence RefSection 3 and Table 2

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	55.9% (LLMs) vs 89% (humans)	—	—	Commonsense QA (cited)	Section 7 cites LLM vs human performance on Commonsense QA	Section 7
Validation perplexity / BLEU on noisy SBNATION data	Perplexity 33.34; BLEU 1.78	—	—	SBNATION (cited)	Section 3 used as example of noisy training data hurting performance	Section 3

What To Try In 7 Days

Implement the paper's simple IQ score (weights for accuracy, consistency, relevance) to triage outputs.

Run a quick audit of tokenization settings and de-duplicate your training/ingestion corpora.

Add a retrieval or browser-check step (search or WebGPT-style) for high-impact queries.

Optimization Features

Token Efficiency

Highlights tokenization choice (BPE, WordPiece, unigram, character) affects model length and meaning

Infra Optimization

Notes high compute and energy costs; suggests sparsely activated experts to reduce cost

Model Optimization

MoE

Training Optimization

Emphasizes compute-data scaling trade-offs (Chinchilla compute-optimal regime)

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

No experiments or released code; proposals are conceptual and need validation.

Proposed IQ metric is linear and simple; may miss other quality dimensions (safety, privacy, timeliness).

When Not To Use

Do not use the paper's IQ metric as a proven filter in safety-critical systems without validation.

Do not treat the survey as a benchmark of model performance; it summarizes literature rather than measure models.

Failure Modes

IQ weights may be chosen poorly, letting biased or fluent-but-false text pass.

Tokenization mismatches can introduce unseen subword artifacts and reduce factual accuracy.

Core Entities

Models

GPT-3GPT-4ChatGPTBERTLLaMA / Llama2GopherPaLMBLOOMBARTT5GLaM

Metrics

perplexityBLEUAccuracynegative log-likelihood (loss)

Datasets

CommonCrawlWebText / WebText2Books1/Books2WikipediaThe PileSQuADGLUEReddit corpusGitHub codearXiv / scientific corpusWuDaoCorpora

Benchmarks

TruthfulQAHaluEvalCrowS-PairsGLUESQuADCommonsense QA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM information quality can be expressed as a weighted sum of three dimensions: accuracy, consistency, relevance.

Large-scale training and dataset choice matter: GPT-3 is cited as 175B parameters trained on ~570 GB of text.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

Train a model to judge and correct its own facts with token-level rewards to cut hallucinations

Key finding

TruthHypo benchmark and KnowHD detector to measure and filter hallucinated scientific hypotheses

Key finding

Use weak or small models as judges: peer prediction rewards honesty and detects deception even when judges are far weaker

Key finding

Induce a model to hallucinate, then penalize those hallucinations at decoding to reduce LLM fabrications

Key finding