TreeEval: benchmark-free LLM evaluation via LLM examiner and tree planning

February 20, 20247 min

Overview

Decision SnapshotNeeds Validation

Method is practical and reduces query cost, but depends on access to high-quality examiner/judge LLMs and careful evaluator selection to avoid judge bias and leakage.

Citations0

Evidence Strength0.70

Confidence0.80

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 70%

Authors

Xiang Li, Yunshi Lan, Chao Yang

Links

Abstract / PDF / Code

Why It Matters For Business

TreeEval reduces evaluation cost and leak risk by generating session-unique questions and using adaptive deepening, so teams can compare models faster and with fewer paid API calls while lowering the chance of benchmark contamination.

Who Should Care

Summary TLDR

TreeEval is a method that uses a high-quality LLM as an examiner to generate adaptive, tree-structured question sessions and another LLM as a judge to compare two target LLMs. The system builds question trees that deepen only where models tie, which cuts the number of questions dramatically. In experiments TreeEval matched AlpacaEval2.0 rankings closely (Spearman ρ=0.83, Kendall τ=0.73) while using about 45 questions on average. Main caveats: it needs a reliable examiner/judge (GPT-4 was used) and judge bias or pretraining overlap can still affect results.

Problem Statement

Standard benchmark-based or LLM-as-judge evaluation leaks to model training and requires many fixed test items. This makes results easy to overfit and expensive to run. We need an evaluation that (1) avoids reusable benchmarks, (2) adapts question difficulty to tell similar models apart, and (3) uses few questions while remaining reliable.

Main Contribution

A benchmark-free evaluation paradigm that uses an LLM examiner to generate session-unique, tree-structured questions so test items are hard to leak.

A tree-planning controller that deepens questions only when models tie, improving discrimination between closely matched LLMs while keeping question count low.

Key Findings

TreeEval rankings strongly match AlpacaEval2.0 rankings.

NumbersSpearman ρ=0.83; Kendall τ=0.73 (Table 1)

Practical UseYou can reproduce leaderboard-style model ordering while avoiding fixed benchmarks; use TreeEval for ranking when you want fewer questions but similar ordering.

Evidence RefTable 1

TreeEval reaches results with very few questions.

Numbersaverage #Q = 45.1 per evaluation session (Table 1)

Practical UseExpect an order-of-magnitude reduction in evaluation queries compared with large benchmarks; this lowers API costs and speeds up comparisons.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Spearman correlation with AlpacaEval2.0ρ = 0.83AlpacaEval2.0 rankingscomparison across six open-source LLMs (Table 1)High rank correlation shows TreeEval produces a similar model order to AlpacaEval2.0 using far fewer questions.Table 1
Kendall correlation with AlpacaEval2.0τ = 0.73AlpacaEval2.0 rankingscomparison across six open-source LLMs (Table 1)Kendall tau agrees with Spearman result, indicating ordinal agreement.Table 1

What To Try In 7 Days

Clone the TreeEval repo and run the provided demo with GPT-4-0613 as examiner to reproduce reported runs.

Run pairwise comparisons of two internal models to get a quick leaderboard with ~50 questions each.

Test BFS vs DFS and topic-generation toggles to see how question counts and ranking stability change for your models.

Agent Features

Memory
session memory of past questions and responses
Planning
tree planning (breadth-first evaluation)topic sampling and question ranking
Tool Use
NER for topic extractionCosine similarity for ranking

Optimization Features

Token Efficiency
fewer evaluation queries reduces token/API cost

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Using GPT-4 or similar as examiner/judge can introduce bias or overlap with judge training data, which may still leak information.

Requires access to an expensive, high-quality examiner LLM to generate discriminative questions reliably.

When Not To Use

When you lack access to a neutral, high-quality examiner or judge LLM.

When you need fixed, fully reproducible benchmark scores tied to a standard dataset.

Failure Modes

Judge bias (verbosity, style, or positional biases) skews pairwise outcomes.

Examiner generates repetitive or low-quality questions, reducing discrimination.

Core Entities

Models

TreeEval (method)GPT-4-0613 (examiner)AlpacaEval2.0 (comparison judge/leaderboard)Mistral-7B-Instruct-v0.2Yi-34B-ChatXwin-LM-13B-V0.1WizardLM-13B-V1.2Zephyr-7B-betaVicuna-33B-v1.3

Metrics

Spearman rhoKendall tauwin-rateAccuracyaverage #Qscore variance

Datasets

MMLU (re-implemented)BBH (re-implemented)

Benchmarks

AlpacaEvalAlpacaEval2.0MT-BenchMMLUBBH

Context Entities

Models

Qwen1.5-110B-ChatMeta-Llama-3-70B-InstructQwen1.5-72B-ChatMixtral-8x7B-Instructchatglm2-6balpaca-13bStarling-LM-7B-alphaVicuna series (7b/13b variants)