AgentRecBench: first public benchmark and simulator for LLM-based agentic recommender systems

Overview

Decision SnapshotNeeds Validation

The benchmark is a usable, public evaluation platform validated by an open challenge; limits include text-only environment and single-agent focus, which lower immediate production fit for multimodal or multi-agent products.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Yu Shang, Peijie Liu, Yuwei Yan, Zijing Wu, Leheng Sheng, Yuanqing Yu, Chumeng Jiang, An Zhang, Fengli Xu, Yu Wang, Min Zhang, Yong Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Agentic, LLM-based recommender pipelines with platform-aware feature extraction can deliver large gains in top-N accuracy on text-rich platforms, and the public benchmark helps measure real-world improvements quickly.

Who Should Care

CTO Product Manager ML Engineer Engineering Lead Data Scientist

Summary TLDR

This paper introduces AgentRecBench: a public text-based simulator, dataset merge (Yelp, Goodreads, Amazon), and a modular agent framework to evaluate LLM-powered recommender agents. It defines three scenarios — classic, evolving-interest, and cold-start — and compares 10+ agents and traditional baselines. Strong agent designs (e.g., Baseline666) beat MF/LightGCN on many tasks. The benchmark ran an open challenge (295 teams, 1,400+ submissions) and is available with a leaderboard.

Problem Statement

There is no standard way to test LLM-driven, agentic recommender systems across practical challenges like time-varying interests and cold-starts. That gap makes it hard to compare designs, reproduce results, and give concrete engineering guidance.

Main Contribution

A textual interaction simulator that merges Yelp, Goodreads, and Amazon into a unified User-Review-Item (U-R-I) network with a standardized query API.

Three evaluation scenarios: classic recommendation, evolving-interest (long/short windows), and user/item cold-start.

Key Findings

Well-engineered agentic systems substantially outperform classical baselines on the benchmarked tasks.

NumbersBaseline666 HR@N up to 69.0% vs MF/LightGCN 15.0% (Amazon, classic)

Practical UseIf you switch from MF/LightGCN to a strong agentic workflow with platform-aware feature engineering, expect large gains in top-N hit rates on similar text-rich datasets.

Evidence RefTable 2

Agent designs that extract platform-specific item and review features drive the best results.

NumbersTop agents (Baseline666/DummyAgent/RecHackers) score 44–71% HR@N on many Amazon/Goodreads splits vs 15% baseline

Practical UsePrioritize item- and review-side feature engineering and platform-adaptive pipelines before changing model family.

Evidence RefTables 2, 5, 6; Appendix case studies

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
HR@N (Top-N hit rate)	Baseline666 up to 69.0%	MF / LightGCN 15.0%	+54.0 pp	Amazon (classic, Table 2)	Table 2 reports Baseline666 69.0% vs MF 15.0%	Table 2
HR@N (cold-start user)	Baseline666 48.7%	MF 15.0%	+33.7 pp	Amazon (cold-start, Table 3)	Table 3 shows Baseline666 48.7% vs MF 15.0%	Table 3

What To Try In 7 Days

Run the AgentRecBench simulator on a small sample and compare a simple agent vs LightGCN using HR@1/3/5.

Prototype item- and review-side feature extraction for your platform and plug features into a BaseAgent workflow.

Submit a baseline to the public leaderboard or mirror its evaluation to track incremental gains.

Agent Features

Memory

short-term / episodic memory modulememory retrieval to inform decisions

Planning

dynamic planning / task decompositioninformation-seeking actions

Tool Use

standardized query APIU-R-I retrieval calls

Frameworks

dynamic planningcomplex reasoning (CoT)tool utilizationmemory management

Is Agentic

Yes

Architectures

LLM agent (instruction-tuned families)modular agent framework

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://huggingface.co/datasets/SGJQovo/AgentRecBench https://tsinghua-fib-lab.github.io/AgentSocietyChallenge/pages/overview.html

Data URLs

https://www.yelp.com/dataset https://sites.google.com/eng.ucsd.edu/ucsdbookgraph/home https://amazon-reviews-2023.github.io/

Risks & Boundaries

Limitations

Environment is text-only; no images or video support yet.

Current benchmark focuses on single-agent pipelines; multi-agent interactions are not evaluated.

When Not To Use

When your product relies heavily on images or video signals.

When you need to evaluate multi-agent or competitive agent setups.

Failure Modes

Poor performance on domains with sparse textual signals (low Yelp HRs reported).

High variance across datasets and LLM families; top agents may require heavy feature engineering.

Core Entities

Models

Qwen-72B-InstructDeepSeekv3GPT-4o-miniBaseline666DummyAgentRecHackersAgent4RecBaseAgentCoTAgentMemoryAgentCoTMemAgentMFLightGCN

Metrics

HR@N (Hit Rate@N)Accuracy

Datasets

AmazonGoodreadsYelp

Benchmarks

AgentRecBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Well-engineered agentic systems substantially outperform classical baselines on the benchmarked tasks.

Agent designs that extract platform-specific item and review features drive the best results.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding