LocalValueBench: a lightweight benchmark to test LLM alignment with Australian values

Overview

Decision SnapshotNeeds Validation

The benchmark is a useful lightweight template for regulators and teams, but its small topic set, limited scale, and human-subjectivity mean it needs expansion and formalization before production audits.

Citations4

Evidence Strength0.60

Confidence0.70

Risk Signals12

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/5

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 30%

Production readiness: 40%

Novelty: 45%

Authors

Gwenyth Isobel Meadows, Nicholas Wai Long Lau, Eva Adelina Susanto, Chi Lok Yu, Aditya Paul

Links

Abstract / PDF / Data

Why It Matters For Business

Models deployed in a region must match local legal and cultural expectations; using a local benchmark uncovers misalignment, refusal behaviors, and reviewer subjectivity before real users encounter them.

Who Should Care

Product Manager CTO ML Engineer Founder Data Scientist

Summary TLDR

This paper introduces LocalValueBench, a small, extensible benchmark and protocol to test whether LLMs follow Australian local values. The method uses a three-step interrogation (neutral, debate, forced/misleading) across six topics (tipping, capital punishment, Category R weapons, refugees, gay marriage, compulsory voting) and human scoring (three reviewers, 1–5 rubric). The authors evaluated GPT-4, Gemini 1.5 Pro, and Claude 3 Sonet and reported mean alignment scores (Claude 3 Sonet 3.725; Gemini 3.314; GPT-4 2.373) and reviewer variability. The benchmark is positioned as a template for regulators to build local tests, but it is limited by narrow topic coverage, small scale, reviewer bias,

Problem Statement

Existing LLM tests reflect the values of their creators and miss local cultural, legal, and ethical norms. Regulators and deployers need a straightforward, repeatable way to measure whether a model respects local values. LocalValueBench aims to fill that gap with an extensible protocol and question set focused on Australian values.

Main Contribution

A simple, repeatable three-step interrogation protocol: neutral, debate, interrogation (misleading)

A curated question set covering six Australia-relevant topics and documented prompts

Key Findings

Claude 3 Sonet scored highest on average for Australian value alignment

Numbersmean=3.725 (scale 1–5)

Practical UseIf you need better out-of-the-box alignment to the paper's Australian reviewer norms, Claude 3 Sonet performed best on these questions.

Evidence RefTable 1; Results section

Gemini 1.5 Pro scored intermediate but showed larger reviewer disagreement

Numbersmean=3.314; std=1.229

Practical UseExpect more variable behavior from Gemini; test across more scenarios before relying on it for sensitive local decisions.

Evidence RefTable 1; Results section

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Mean alignment score (Claude 3 Sonet)	3.725 (1–5)	—	—	All six interrogation topics; averaged	Table 1 mean	Table 1
Mean alignment score (Gemini 1.5 Pro)	3.314 (1–5)	—	—	All six interrogation topics; averaged	Table 1 mean	Table 1

What To Try In 7 Days

Run LocalValueBench prompts on your target LLMs to spot obvious misalignments and refusals

Record refusal rates separately from content scores and inspect causes

Recruit 3 reviewers to score a small sample and compute mean + std dev to measure subjectivity effects

Agent Features

Tool Use

Prompt engineeringRAG (suggested)LoRARLHF (suggested)MoE (suggested)

Frameworks

Human reviewer scoringInterrogation protocol

Optimization Features

Model Optimization

LoRAMoE

Training Optimization

RLHF (recommended to align with local feedback)

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://arxiv.org/abs/2408.01460v1

Risks & Boundaries

Limitations

Only six topics were evaluated, limiting topical coverage

Small-scale evaluation of three commercial LLMs only

When Not To Use

As a comprehensive global alignment test without local adaptation

As the sole safety gate for automated deployment decisions

Failure Modes

Model refusals counted as zeros can hide useful partial alignment

Reviewer bias can inflate or deflate scores without calibration

Core Entities

Models

GPT-4Gemini 1.5 ProClaude 3 Sonet

Metrics

Human reviewer mean score (1-5)Standard deviation of reviewer scoresRefusal / no-response counts

Benchmarks

LocalValueBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Claude 3 Sonet scored highest on average for Australian value alignment

Gemini 1.5 Pro scored intermediate but showed larger reviewer disagreement

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Benchmarks

You May Also Want to Read

ThaiSafetyBench: 1,954 Thai malicious prompts reveal cultural blind spots in LLM safety

Key finding

Model judges reward ethics-based refusals; human users penalize them

Key finding

A 300k-case, 22-language benchmark that tests how jailbreak prompts make LLMs write fake news

Key finding

MEDIC: a practical framework to test clinical LLM safety, hallucinations, and operational utility

Key finding

A balanced 44-class benchmark (440 prompts + 8.8K mutations) for testing whether LLMs refuse unsafe requests, plus a fast judge design.

Key finding