Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.6
Citation Count
0
Why It Matters For Business
TreeEval reduces evaluation cost and leak risk by generating session-unique questions and using adaptive deepening, so teams can compare models faster and with fewer paid API calls while lowering the chance of benchmark contamination.
Summary TLDR
TreeEval is a method that uses a high-quality LLM as an examiner to generate adaptive, tree-structured question sessions and another LLM as a judge to compare two target LLMs. The system builds question trees that deepen only where models tie, which cuts the number of questions dramatically. In experiments TreeEval matched AlpacaEval2.0 rankings closely (Spearman ρ=0.83, Kendall τ=0.73) while using about 45 questions on average. Main caveats: it needs a reliable examiner/judge (GPT-4 was used) and judge bias or pretraining overlap can still affect results.
Problem Statement
Standard benchmark-based or LLM-as-judge evaluation leaks to model training and requires many fixed test items. This makes results easy to overfit and expensive to run. We need an evaluation that (1) avoids reusable benchmarks, (2) adapts question difficulty to tell similar models apart, and (3) uses few questions while remaining reliable.
Main Contribution
A benchmark-free evaluation paradigm that uses an LLM examiner to generate session-unique, tree-structured questions so test items are hard to leak.
A tree-planning controller that deepens questions only when models tie, improving discrimination between closely matched LLMs while keeping question count low.
Empirical validation showing strong agreement with AlpacaEval2.0 rankings using far fewer questions, plus ablations that identify key design components.
Key Findings
TreeEval rankings strongly match AlpacaEval2.0 rankings.
TreeEval reaches results with very few questions.
Tree planning and controller choices matter for efficiency and accuracy.
TreeEval produces stable scores across runs with low variance.
Results
Spearman correlation with AlpacaEval2.0
Kendall correlation with AlpacaEval2.0
Average number of questions per session
Ablation (BFS→DFS)
Who Should Care
What To Try In 7 Days
Clone the TreeEval repo and run the provided demo with GPT-4-0613 as examiner to reproduce reported runs.
Run pairwise comparisons of two internal models to get a quick leaderboard with ~50 questions each.
Test BFS vs DFS and topic-generation toggles to see how question counts and ranking stability change for your models.
Agent Features
Memory
- session memory of past questions and responses
Planning
- tree planning (breadth-first evaluation)
- topic sampling and question ranking
Tool Use
- NER for topic extraction
- Cosine similarity for ranking
Optimization Features
Token Efficiency
- fewer evaluation queries reduces token/API cost
Reproducibility
Code Urls
Code Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Using GPT-4 or similar as examiner/judge can introduce bias or overlap with judge training data, which may still leak information.
- Requires access to an expensive, high-quality examiner LLM to generate discriminative questions reliably.
- Topic extraction (NER) and examiner prompts can produce low-quality or off-topic questions if not tuned.
When Not To Use
- When you lack access to a neutral, high-quality examiner or judge LLM.
- When you need fixed, fully reproducible benchmark scores tied to a standard dataset.
- When licensing or audit constraints forbid using third-party LLMs as evaluators.
Failure Modes
- Judge bias (verbosity, style, or positional biases) skews pairwise outcomes.
- Examiner generates repetitive or low-quality questions, reducing discrimination.
- Poor baseline selection for pairwise comparisons distorts ranking order.
Core Entities
Models
- TreeEval (method)
- GPT-4-0613 (examiner)
- AlpacaEval2.0 (comparison judge/leaderboard)
- Mistral-7B-Instruct-v0.2
- Yi-34B-Chat
- Xwin-LM-13B-V0.1
- WizardLM-13B-V1.2
- Zephyr-7B-beta
- Vicuna-33B-v1.3
Metrics
- Spearman rho
- Kendall tau
- win-rate
- Accuracy
- average #Q
- score variance
Datasets
- MMLU (re-implemented)
- BBH (re-implemented)
Benchmarks
- AlpacaEval
- AlpacaEval2.0
- MT-Bench
- MMLU
- BBH
Context Entities
Models
- Qwen1.5-110B-Chat
- Meta-Llama-3-70B-Instruct
- Qwen1.5-72B-Chat
- Mixtral-8x7B-Instruct
- chatglm2-6b
- alpaca-13b
- Starling-LM-7B-alpha
- Vicuna series (7b/13b variants)

