LocalValueBench: a lightweight benchmark to test LLM alignment with Australian values

July 27, 20247 min

Overview

Production Readiness

0.4

Novelty Score

0.45

Cost Impact Score

0.3

Citation Count

4

Authors

Gwenyth Isobel Meadows, Nicholas Wai Long Lau, Eva Adelina Susanto, Chi Lok Yu, Aditya Paul

Links

Abstract / PDF

Why It Matters For Business

Models deployed in a region must match local legal and cultural expectations; using a local benchmark uncovers misalignment, refusal behaviors, and reviewer subjectivity before real users encounter them.

Summary TLDR

This paper introduces LocalValueBench, a small, extensible benchmark and protocol to test whether LLMs follow Australian local values. The method uses a three-step interrogation (neutral, debate, forced/misleading) across six topics (tipping, capital punishment, Category R weapons, refugees, gay marriage, compulsory voting) and human scoring (three reviewers, 1–5 rubric). The authors evaluated GPT-4, Gemini 1.5 Pro, and Claude 3 Sonet and reported mean alignment scores (Claude 3 Sonet 3.725; Gemini 3.314; GPT-4 2.373) and reviewer variability. The benchmark is positioned as a template for regulators to build local tests, but it is limited by narrow topic coverage, small scale, reviewer bias,

Problem Statement

Existing LLM tests reflect the values of their creators and miss local cultural, legal, and ethical norms. Regulators and deployers need a straightforward, repeatable way to measure whether a model respects local values. LocalValueBench aims to fill that gap with an extensible protocol and question set focused on Australian values.

Main Contribution

A simple, repeatable three-step interrogation protocol: neutral, debate, interrogation (misleading)

A curated question set covering six Australia-relevant topics and documented prompts

A 1–5 human reviewer marking rubric and three-reviewer scoring process

An open-format benchmark (LocalValueBench) intended for reuse and local adaptation

A small comparative evaluation of three commercial LLMs (GPT-4, Gemini 1.5 Pro, Claude 3 Sonet) with summary statistics

Key Findings

Claude 3 Sonet scored highest on average for Australian value alignment

Numbersmean=3.725 (scale 1–5)

Gemini 1.5 Pro scored intermediate but showed larger reviewer disagreement

Numbersmean=3.314; std=1.229

GPT-4 scored lowest and had outright refusals on some interrogation prompts

Numbersmean=2.373; refusal -> score 0 for some items

Human reviewer variation was measurable and non-trivial

Numbersstd deviations: GPT-4 0.989; Gemini 1.229; Claude 0.887

Results

Mean alignment score (Claude 3 Sonet)

Value3.725 (1–5)

Mean alignment score (Gemini 1.5 Pro)

Value3.314 (1–5)

Mean alignment score (GPT-4)

Value2.373 (1–5)

Reviewer disagreement (std dev)

ValueGPT-4 0.989; Gemini 1.229; Claude 0.887

Refusal / no-response observed

ValueGPT-4: refusals for some interrogations (score 0); Gemini: no response for Capital Punishment

Who Should Care

What To Try In 7 Days

Run LocalValueBench prompts on your target LLMs to spot obvious misalignments and refusals

Record refusal rates separately from content scores and inspect causes

Recruit 3 reviewers to score a small sample and compute mean + std dev to measure subjectivity effects

Agent Features

Tool Use

  • Prompt engineering
  • RAG (suggested)
  • LoRA
  • RLHF (suggested)
  • MoE (suggested)

Frameworks

  • Human reviewer scoring
  • Interrogation protocol

Optimization Features

Model Optimization

  • LoRA
  • MoE

Training Optimization

  • RLHF (recommended to align with local feedback)

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only six topics were evaluated, limiting topical coverage
  • Small-scale evaluation of three commercial LLMs only
  • Human reviewer subjectivity and small reviewer pool affect reliability
  • No code release or automated scoring pipeline provided
  • Authors note time constraints and student-led development

When Not To Use

  • As a comprehensive global alignment test without local adaptation
  • As the sole safety gate for automated deployment decisions
  • To draw firm claims about model behavior beyond the six tested topics

Failure Modes

  • Model refusals counted as zeros can hide useful partial alignment
  • Reviewer bias can inflate or deflate scores without calibration
  • Limited topic coverage may miss other cultural harms
  • Small sample of models prevents broad generalization

Core Entities

Models

  • GPT-4
  • Gemini 1.5 Pro
  • Claude 3 Sonet

Metrics

  • Human reviewer mean score (1-5)
  • Standard deviation of reviewer scores
  • Refusal / no-response counts

Benchmarks

  • LocalValueBench