LocalValueBench: a lightweight benchmark to test LLM alignment with Australian values

July 27, 20247 min

Overview

Decision SnapshotNeeds Validation

The benchmark is a useful lightweight template for regulators and teams, but its small topic set, limited scale, and human-subjectivity mean it needs expansion and formalization before production audits.

Citations4

Evidence Strength0.60

Confidence0.70

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 45%

Authors

Gwenyth Isobel Meadows, Nicholas Wai Long Lau, Eva Adelina Susanto, Chi Lok Yu, Aditya Paul

Links

Abstract / PDF / Data

Why It Matters For Business

Models deployed in a region must match local legal and cultural expectations; using a local benchmark uncovers misalignment, refusal behaviors, and reviewer subjectivity before real users encounter them.

Who Should Care

Summary TLDR

This paper introduces LocalValueBench, a small, extensible benchmark and protocol to test whether LLMs follow Australian local values. The method uses a three-step interrogation (neutral, debate, forced/misleading) across six topics (tipping, capital punishment, Category R weapons, refugees, gay marriage, compulsory voting) and human scoring (three reviewers, 1–5 rubric). The authors evaluated GPT-4, Gemini 1.5 Pro, and Claude 3 Sonet and reported mean alignment scores (Claude 3 Sonet 3.725; Gemini 3.314; GPT-4 2.373) and reviewer variability. The benchmark is positioned as a template for regulators to build local tests, but it is limited by narrow topic coverage, small scale, reviewer bias,

Problem Statement

Existing LLM tests reflect the values of their creators and miss local cultural, legal, and ethical norms. Regulators and deployers need a straightforward, repeatable way to measure whether a model respects local values. LocalValueBench aims to fill that gap with an extensible protocol and question set focused on Australian values.

Main Contribution

A simple, repeatable three-step interrogation protocol: neutral, debate, interrogation (misleading)

A curated question set covering six Australia-relevant topics and documented prompts

Key Findings

Claude 3 Sonet scored highest on average for Australian value alignment

Numbersmean=3.725 (scale 15)

Practical UseIf you need better out-of-the-box alignment to the paper's Australian reviewer norms, Claude 3 Sonet performed best on these questions.

Evidence RefTable 1; Results section

Gemini 1.5 Pro scored intermediate but showed larger reviewer disagreement

Numbersmean=3.314; std=1.229

Practical UseExpect more variable behavior from Gemini; test across more scenarios before relying on it for sensitive local decisions.

Evidence RefTable 1; Results section

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Mean alignment score (Claude 3 Sonet)3.725 (15)All six interrogation topics; averagedTable 1 meanTable 1
Mean alignment score (Gemini 1.5 Pro)3.314 (15)All six interrogation topics; averagedTable 1 meanTable 1

What To Try In 7 Days

Run LocalValueBench prompts on your target LLMs to spot obvious misalignments and refusals

Record refusal rates separately from content scores and inspect causes

Recruit 3 reviewers to score a small sample and compute mean + std dev to measure subjectivity effects

Agent Features

Tool Use
Prompt engineeringRAG (suggested)LoRARLHF (suggested)MoE (suggested)
Frameworks
Human reviewer scoringInterrogation protocol

Optimization Features

Model Optimization
LoRAMoE
Training Optimization
RLHF (recommended to align with local feedback)

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Only six topics were evaluated, limiting topical coverage

Small-scale evaluation of three commercial LLMs only

When Not To Use

As a comprehensive global alignment test without local adaptation

As the sole safety gate for automated deployment decisions

Failure Modes

Model refusals counted as zeros can hide useful partial alignment

Reviewer bias can inflate or deflate scores without calibration

Core Entities

Models

GPT-4Gemini 1.5 ProClaude 3 Sonet

Metrics

Human reviewer mean score (1-5)Standard deviation of reviewer scoresRefusal / no-response counts

Benchmarks

LocalValueBench