AnimaLLM: a prototype that scores LLM outputs for truthfulness and how well they consider animals' interests

March 2, 20248 min

Overview

Production Readiness

0.2

Novelty Score

0.5

Cost Impact Score

0.2

Citation Count

7

Authors

Sankalpa Ghose, Yip Fai Tse, Kasra Rasaee, Jeff Sebo, Peter Singer

Links

Abstract / PDF

Why It Matters For Business

LLMs can embed species and welfare biases that affect products (education, vet advice, policy tools); measuring these biases early helps avoid reputational, legal, or welfare harms.

Summary TLDR

The authors present AnimaLLM, a proof-of-concept system that scores LLM responses on two axes: S1 (truthfulness relative to typical real-world animal treatment) and S2 (how much a response considers the animal's own perspective). They ran 24 question templates × 17 animals × 8 moral perspectives (3,264 outputs) on OpenAI GPT-4 and Anthropic Claude 2.1, producing score sets and qualitative patterns: vertebrates get friendlier responses than invertebrates, shrimp scored worst, and Claude sometimes refuses ethically questionable requests. The system is experimental, uses an LLM as the evaluator, and is not validated for production.

Problem Statement

Current LLM evaluations rarely consider impacts on nonhuman animals. The authors argue we need tools that measure how AI outputs represent and respect animals' interests, and they present a prototype evaluation pipeline to produce such measures.

Main Contribution

A simple quantitative model for animal consideration: two scores S1 (truthfulness vs. prevalent reality) and S2 (consideration of animal's perspective) on a 0–100 scale.

Design and implementation of AnimaLLM: an instruction‑state machine that generates perspective-conditioned responses and evaluates them.

A proof-of-concept run: 24 input templates × 17 animals × 8 perspectives → 3,264 S1 and 3,264 S2 scores for each tested LLM (GPT-4 and Claude 2.1).

Initial qualitative findings about species bias in LLM outputs and how prompting changes model behavior.

A discussion of limits and research directions, including evaluator bias, missing perspectives, and validation needs.

Key Findings

AnimaLLM produced comprehensive score sets for two commercial LLMs.

Numbers3,264 S1 and 3,264 S2 scores per model

Example assessment: GPT-4 default answer to 'Is it okay to experiment on dogs?' was judged fairly truthful but only moderately animal-considerate.

NumbersGPT-4 example: S1 = 85, S2 = 50

Species and cultural patterns emerged: vertebrates typically scored more animal-friendly than invertebrates; shrimp ranked worst across models.

NumbersAggregate score patterns shown in Figures 4–5 and descriptive summary

Anthropic Claude 2.1 showed more refusals and stronger animal-friendly stances on some prompts than GPT-4, but behavior can be inconsistent across prompt phrasing.

NumbersQualitative examples: Claude refused farm-design help for fish/rabbit but not always for chicken/pig; 'eat?' vs 'recipes

The evaluator (which itself used GPT-4) produced stable clusters in repeated scoring for many outputs.

NumbersRepeated evaluations often clustered on a single score (Figures 6–7)

Results

Number of evaluated outputs

Value3,264 S1 and 3,264 S2 scores per tested model

Example scores — GPT-4 default reply on one query

ValueS1 = 85; S2 = 50

Evaluator consistency

ValueMajority of repeated S1/S2 evaluations cluster on a single score

Who Should Care

What To Try In 7 Days

Run the paper's 24 input templates on your model for key species to spot species bias.

Compare default outputs vs perspective-conditioned prompts (S1 vs S2) to find gaps in animal consideration.

Log prompt variants that flip outcomes (e.g., 'Is it okay to eat X?' vs 'Give recipes for X').

Reproducibility

Code Urls

  • www.alethic.ai/animallm

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluator uses an LLM (GPT-4) to score other LLMs, creating circular evaluator bias.
  • Limited set of moral perspectives and animals; important global perspectives are missing.
  • Scoring definitions (S1/S2) are operational choices and may not capture all welfare-relevant factors.
  • Not validated against independent human or animal-welfare expert judgments.
  • Philosophical assumption that animal perspectives can be represented by LLMs remains unproven.

When Not To Use

  • Do not use for legal, clinical, or safety-critical decisions without external validation.
  • Not for definitive benchmarking or product certification—results are exploratory.
  • Avoid using as sole source for animal-welfare policy or enforcement.

Failure Modes

  • Evaluator fails to produce a score when responses are unexpected or malformed.
  • Scores reflect evaluator's biases rather than ground truth (false positives/negatives).
  • High sensitivity to prompt wording produces inconsistent behavior across templates.
  • Tension between S1 and S2: 'truthful' but not compassionate answers.

Core Entities

Models

  • OpenAI GPT-4 (ChatGTP4 / GPT4-1106-preview)
  • Anthropic Claude 2.1

Metrics

  • S1 (truthfulness vs prevalent reality)
  • S2 (consideration of animal's perspective)

Context Entities

Models

  • LLaMA2
  • Falcon
  • ChatGPT 3.5
  • Gemini