TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

Overview

Decision SnapshotNeeds Validation

Method is practical and reduces query cost, but depends on access to high-quality examiner/judge LLMs and careful evaluator selection to avoid judge bias and leakage.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Xiang Li, Yunshi Lan, Chao Yang

Links

Abstract / PDF / Code

Why It Matters For Business

TreeEval reduces evaluation cost and leak risk by generating session-unique questions and using adaptive deepening, so teams can compare models faster and with fewer paid API calls while lowering the chance of benchmark contamination.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

TreeEval is a method that uses a high-quality LLM as an examiner to generate adaptive, tree-structured question sessions and another LLM as a judge to compare two target LLMs. The system builds question trees that deepen only where models tie, which cuts the number of questions dramatically. In experiments TreeEval matched AlpacaEval2.0 rankings closely (Spearman ρ=0.83, Kendall τ=0.73) while using about 45 questions on average. Main caveats: it needs a reliable examiner/judge (GPT-4 was used) and judge bias or pretraining overlap can still affect results.

Problem Statement

Standard benchmark-based or LLM-as-judge evaluation leaks to model training and requires many fixed test items. This makes results easy to overfit and expensive to run. We need an evaluation that (1) avoids reusable benchmarks, (2) adapts question difficulty to tell similar models apart, and (3) uses few questions while remaining reliable.

Main Contribution

A benchmark-free evaluation paradigm that uses an LLM examiner to generate session-unique, tree-structured questions so test items are hard to leak.

A tree-planning controller that deepens questions only when models tie, improving discrimination between closely matched LLMs while keeping question count low.

Key Findings

TreeEval rankings strongly match AlpacaEval2.0 rankings.

NumbersSpearman ρ=0.83; Kendall τ=0.73 (Table 1)

Practical UseYou can reproduce leaderboard-style model ordering while avoiding fixed benchmarks; use TreeEval for ranking when you want fewer questions but similar ordering.

Evidence RefTable 1

TreeEval reaches results with very few questions.

Numbersaverage #Q = 45.1 per evaluation session (Table 1)

Practical UseExpect an order-of-magnitude reduction in evaluation queries compared with large benchmarks; this lowers API costs and speeds up comparisons.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Spearman correlation with AlpacaEval2.0	ρ = 0.83	AlpacaEval2.0 rankings	—	comparison across six open-source LLMs (Table 1)	High rank correlation shows TreeEval produces a similar model order to AlpacaEval2.0 using far fewer questions.	Table 1
Kendall correlation with AlpacaEval2.0	τ = 0.73	AlpacaEval2.0 rankings	—	comparison across six open-source LLMs (Table 1)	Kendall tau agrees with Spearman result, indicating ordinal agreement.	Table 1

What To Try In 7 Days

Clone the TreeEval repo and run the provided demo with GPT-4-0613 as examiner to reproduce reported runs.

Run pairwise comparisons of two internal models to get a quick leaderboard with ~50 questions each.

Test BFS vs DFS and topic-generation toggles to see how question counts and ranking stability change for your models.

Agent Features

Memory

session memory of past questions and responses

Planning

tree planning (breadth-first evaluation)topic sampling and question ranking

Tool Use

NER for topic extractionCosine similarity for ranking

Optimization Features

Token Efficiency

fewer evaluation queries reduces token/API cost

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/Ashura5/TreeEval

Risks & Boundaries

Limitations

Using GPT-4 or similar as examiner/judge can introduce bias or overlap with judge training data, which may still leak information.

Requires access to an expensive, high-quality examiner LLM to generate discriminative questions reliably.

When Not To Use

When you lack access to a neutral, high-quality examiner or judge LLM.

When you need fixed, fully reproducible benchmark scores tied to a standard dataset.

Failure Modes

Judge bias (verbosity, style, or positional biases) skews pairwise outcomes.

Examiner generates repetitive or low-quality questions, reducing discrimination.

Core Entities

Models

TreeEval (method)GPT-4-0613 (examiner)AlpacaEval2.0 (comparison judge/leaderboard)Mistral-7B-Instruct-v0.2Yi-34B-ChatXwin-LM-13B-V0.1WizardLM-13B-V1.2Zephyr-7B-betaVicuna-33B-v1.3

Metrics

Spearman rhoKendall tauwin-rateAccuracyaverage #Qscore variance

Datasets

MMLU (re-implemented)BBH (re-implemented)

Benchmarks

AlpacaEvalAlpacaEval2.0MT-BenchMMLUBBH

Context Entities

Models

Qwen1.5-110B-ChatMeta-Llama-3-70B-InstructQwen1.5-72B-ChatMixtral-8x7B-Instructchatglm2-6balpaca-13bStarling-LM-7B-alphaVicuna series (7b/13b variants)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

TreeEval rankings strongly match AlpacaEval2.0 rankings.

TreeEval reaches results with very few questions.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

You May Also Want to Read

A weekly-updated, contamination-free medical benchmark plus automated rubrics that align better with physicians than LLM-as-a-judge

Key finding

When synthetic training data and LLM evaluators are related, evaluators unfairly favor the student models

Key finding

Auto-update benchmarks with two LLM-driven strategies to reduce leakage and tune difficulty

Key finding

Ko-H5 and an open Korean LLM leaderboard: private tests, new Korean tasks, and when benchmarks stop helping

Key finding

Small prompt or format changes can reorder LLM leaderboards by many ranks

Key finding