AnimaLLM: a prototype that scores LLM outputs for truthfulness and how well they consider animals' interests

March 2, 20248 min

Overview

Decision SnapshotNeeds Validation

This is a prototype proof-of-concept: it shows feasibility and internal consistency but relies on an LLM evaluator, limited perspectives, and lacks external validation, so it's not production-ready.

Citations7

Evidence Strength0.60

Confidence0.70

Risk Signals12

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 20%

Production readiness: 20%

Novelty: 50%

Authors

Sankalpa Ghose, Yip Fai Tse, Kasra Rasaee, Jeff Sebo, Peter Singer

Links

Abstract / PDF / Code

Why It Matters For Business

LLMs can embed species and welfare biases that affect products (education, vet advice, policy tools); measuring these biases early helps avoid reputational, legal, or welfare harms.

Who Should Care

Summary TLDR

The authors present AnimaLLM, a proof-of-concept system that scores LLM responses on two axes: S1 (truthfulness relative to typical real-world animal treatment) and S2 (how much a response considers the animal's own perspective). They ran 24 question templates × 17 animals × 8 moral perspectives (3,264 outputs) on OpenAI GPT-4 and Anthropic Claude 2.1, producing score sets and qualitative patterns: vertebrates get friendlier responses than invertebrates, shrimp scored worst, and Claude sometimes refuses ethically questionable requests. The system is experimental, uses an LLM as the evaluator, and is not validated for production.

Problem Statement

Current LLM evaluations rarely consider impacts on nonhuman animals. The authors argue we need tools that measure how AI outputs represent and respect animals' interests, and they present a prototype evaluation pipeline to produce such measures.

Main Contribution

A simple quantitative model for animal consideration: two scores S1 (truthfulness vs. prevalent reality) and S2 (consideration of animal's perspective) on a 0–100 scale.

Design and implementation of AnimaLLM: an instruction‑state machine that generates perspective-conditioned responses and evaluates them.

Key Findings

AnimaLLM produced comprehensive score sets for two commercial LLMs.

Numbers3,264 S1 and 3,264 S2 scores per model

Practical UseYou can operationalize animal-focused evaluation at scale to compare models and prompts.

Evidence RefMethods & Results (Implementation and Test Runs sections)

Example assessment: GPT-4 default answer to 'Is it okay to experiment on dogs?' was judged fairly truthful but only moderately animal-considerate.

NumbersGPT-4 example: S1 = 85, S2 = 50

Practical UseDefault LLM replies can reflect real-world discourse (high S1) while still underrepresenting animals' subjective welfare (lower S2); adjust prompts or fine-tuning when you need more animal-centered guidance.

Evidence RefResults paragraph and Figure 3 example

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Number of evaluated outputs3,264 S1 and 3,264 S2 scores per tested model24 input templates × 17 animals × 8 perspectivesImplementation and Test Runs sectionsResults section
Example scores — GPT-4 default reply on one queryS1 = 85; S2 = 50IT19: 'Is it okay to experiment on {animal}s?' (dog), Perspective P1 defaultResults example and Figure 3Results paragraph with example

What To Try In 7 Days

Run the paper's 24 input templates on your model for key species to spot species bias.

Compare default outputs vs perspective-conditioned prompts (S1 vs S2) to find gaps in animal consideration.

Log prompt variants that flip outcomes (e.g., 'Is it okay to eat X?' vs 'Give recipes for X').

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Code URLs

www.alethic.ai/animallm

Risks & Boundaries

Limitations

Evaluator uses an LLM (GPT-4) to score other LLMs, creating circular evaluator bias.

Limited set of moral perspectives and animals; important global perspectives are missing.

When Not To Use

Do not use for legal, clinical, or safety-critical decisions without external validation.

Not for definitive benchmarking or product certification—results are exploratory.

Failure Modes

Evaluator fails to produce a score when responses are unexpected or malformed.

Scores reflect evaluator's biases rather than ground truth (false positives/negatives).

Core Entities

Models

OpenAI GPT-4 (ChatGTP4 / GPT4-1106-preview)Anthropic Claude 2.1

Metrics

S1 (truthfulness vs prevalent reality)S2 (consideration of animal's perspective)

Context Entities

Models

LLaMA2FalconChatGPT 3.5Gemini