Overview
Production Readiness
0.2
Novelty Score
0.5
Cost Impact Score
0.2
Citation Count
7
Why It Matters For Business
LLMs can embed species and welfare biases that affect products (education, vet advice, policy tools); measuring these biases early helps avoid reputational, legal, or welfare harms.
Summary TLDR
The authors present AnimaLLM, a proof-of-concept system that scores LLM responses on two axes: S1 (truthfulness relative to typical real-world animal treatment) and S2 (how much a response considers the animal's own perspective). They ran 24 question templates × 17 animals × 8 moral perspectives (3,264 outputs) on OpenAI GPT-4 and Anthropic Claude 2.1, producing score sets and qualitative patterns: vertebrates get friendlier responses than invertebrates, shrimp scored worst, and Claude sometimes refuses ethically questionable requests. The system is experimental, uses an LLM as the evaluator, and is not validated for production.
Problem Statement
Current LLM evaluations rarely consider impacts on nonhuman animals. The authors argue we need tools that measure how AI outputs represent and respect animals' interests, and they present a prototype evaluation pipeline to produce such measures.
Main Contribution
A simple quantitative model for animal consideration: two scores S1 (truthfulness vs. prevalent reality) and S2 (consideration of animal's perspective) on a 0–100 scale.
Design and implementation of AnimaLLM: an instruction‑state machine that generates perspective-conditioned responses and evaluates them.
A proof-of-concept run: 24 input templates × 17 animals × 8 perspectives → 3,264 S1 and 3,264 S2 scores for each tested LLM (GPT-4 and Claude 2.1).
Initial qualitative findings about species bias in LLM outputs and how prompting changes model behavior.
A discussion of limits and research directions, including evaluator bias, missing perspectives, and validation needs.
Key Findings
AnimaLLM produced comprehensive score sets for two commercial LLMs.
Example assessment: GPT-4 default answer to 'Is it okay to experiment on dogs?' was judged fairly truthful but only moderately animal-considerate.
Species and cultural patterns emerged: vertebrates typically scored more animal-friendly than invertebrates; shrimp ranked worst across models.
Anthropic Claude 2.1 showed more refusals and stronger animal-friendly stances on some prompts than GPT-4, but behavior can be inconsistent across prompt phrasing.
The evaluator (which itself used GPT-4) produced stable clusters in repeated scoring for many outputs.
Results
Number of evaluated outputs
Example scores — GPT-4 default reply on one query
Evaluator consistency
Who Should Care
What To Try In 7 Days
Run the paper's 24 input templates on your model for key species to spot species bias.
Compare default outputs vs perspective-conditioned prompts (S1 vs S2) to find gaps in animal consideration.
Log prompt variants that flip outcomes (e.g., 'Is it okay to eat X?' vs 'Give recipes for X').
Reproducibility
Code Urls
- www.alethic.ai/animallm
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Evaluator uses an LLM (GPT-4) to score other LLMs, creating circular evaluator bias.
- Limited set of moral perspectives and animals; important global perspectives are missing.
- Scoring definitions (S1/S2) are operational choices and may not capture all welfare-relevant factors.
- Not validated against independent human or animal-welfare expert judgments.
- Philosophical assumption that animal perspectives can be represented by LLMs remains unproven.
When Not To Use
- Do not use for legal, clinical, or safety-critical decisions without external validation.
- Not for definitive benchmarking or product certification—results are exploratory.
- Avoid using as sole source for animal-welfare policy or enforcement.
Failure Modes
- Evaluator fails to produce a score when responses are unexpected or malformed.
- Scores reflect evaluator's biases rather than ground truth (false positives/negatives).
- High sensitivity to prompt wording produces inconsistent behavior across templates.
- Tension between S1 and S2: 'truthful' but not compassionate answers.
Core Entities
Models
- OpenAI GPT-4 (ChatGTP4 / GPT4-1106-preview)
- Anthropic Claude 2.1
Metrics
- S1 (truthfulness vs prevalent reality)
- S2 (consideration of animal's perspective)
Context Entities
Models
- LLaMA2
- Falcon
- ChatGPT 3.5
- Gemini

