A short review plus a simple scoring formula to judge LLM output quality

January 23, 20246 min

Overview

Decision SnapshotNeeds Validation

This is a survey plus a simple scoring idea without new experiments; actionable but requires validation in real systems.

Citations4

Evidence Strength0.40

Confidence0.70

Risk Signals8

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/3

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 50%

Production readiness: 30%

Novelty: 35%

Authors

Rick Rejeleene, Xiaowei Xu, John Talburt

Links

Abstract / PDF

Why It Matters For Business

Low information quality in LLM outputs can cause bad decisions, legal risk, and user distrust; measuring and filtering quality reduces downstream risk and saves money on remediation.

Who Should Care

Summary TLDR

This paper reviews how data quality, tokenization, and training scale drive trust problems in large language models (LLMs). It proposes a simple, domain‑agnostic information quality score that combines accuracy, consistency and relevance as weighted factors. The authors survey tokenizers, datasets, scaling laws (Chinchilla, Broken Neural Scaling Laws), and practical mitigations such as filtering, de-duplication, human feedback, and retrieval-based checks. No new experiments or code are released.

Problem Statement

LLMs produce useful text but also unreliable, biased or fabricated outputs. The paper argues information quality failures trace to training data issues (noise, bias, tokenization, duplication), scaling choices, and gaps in verification. It proposes a compact, tunable metric to quantify information quality of generated text.

Main Contribution

Survey of data-quality drivers of LLM trust: tokenization, bias, duplication, dataset mix, and scaling effects.

Proposal of a simple, linear information-quality metric combining accuracy, consistency and relevance with tunable weights.

Key Findings

LLM information quality can be expressed as a weighted sum of three dimensions: accuracy, consistency, relevance.

Practical UseImplement a simple weighted score (custom weights per use case) to rank or filter model outputs before downstream use.

Evidence RefSection 3 and Mathematical formulation of Information Quality Evaluation

Large-scale training and dataset choice matter: GPT-3 is cited as 175B parameters trained on ~570 GB of text.

Numbers175B params; 570 GB training data

Practical UseWhen comparing models, include both parameter count and data scale; bigger models alone do not guarantee better, verifiable answers.

Evidence RefSection 3 and Table 2

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracy55.9% (LLMs) vs 89% (humans)Commonsense QA (cited)Section 7 cites LLM vs human performance on Commonsense QASection 7
Validation perplexity / BLEU on noisy SBNATION dataPerplexity 33.34; BLEU 1.78SBNATION (cited)Section 3 used as example of noisy training data hurting performanceSection 3

What To Try In 7 Days

Implement the paper's simple IQ score (weights for accuracy, consistency, relevance) to triage outputs.

Run a quick audit of tokenization settings and de-duplicate your training/ingestion corpora.

Add a retrieval or browser-check step (search or WebGPT-style) for high-impact queries.

Optimization Features

Token Efficiency

Highlights tokenization choice (BPE, WordPiece, unigram, character) affects model length and meaning

Infra Optimization
Notes high compute and energy costs; suggests sparsely activated experts to reduce cost
Model Optimization
MoE
Training Optimization
Emphasizes compute-data scaling trade-offs (Chinchilla compute-optimal regime)

Reproducibility

Code AvailableNo
Data AvailableNo
Open Source StatusUnknown
LicenseUnknown

Risks & Boundaries

Limitations

No experiments or released code; proposals are conceptual and need validation.

Proposed IQ metric is linear and simple; may miss other quality dimensions (safety, privacy, timeliness).

When Not To Use

Do not use the paper's IQ metric as a proven filter in safety-critical systems without validation.

Do not treat the survey as a benchmark of model performance; it summarizes literature rather than measure models.

Failure Modes

IQ weights may be chosen poorly, letting biased or fluent-but-false text pass.

Tokenization mismatches can introduce unseen subword artifacts and reduce factual accuracy.

Core Entities

Models

GPT-3GPT-4ChatGPTBERTLLaMA / Llama2GopherPaLMBLOOMBARTT5GLaM

Metrics

perplexityBLEUAccuracynegative log-likelihood (loss)

Datasets

CommonCrawlWebText / WebText2Books1/Books2WikipediaThe PileSQuADGLUEReddit corpusGitHub codearXiv / scientific corpusWuDaoCorpora

Benchmarks

TruthfulQAHaluEvalCrowS-PairsGLUESQuADCommonsense QA