AgentRecBench: first public benchmark and simulator for LLM-based agentic recommender systems

May 26, 20257 min

Overview

Decision SnapshotNeeds Validation

The benchmark is a usable, public evaluation platform validated by an open challenge; limits include text-only environment and single-agent focus, which lower immediate production fit for multimodal or multi-agent products.

Citations0

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 70%

Authors

Yu Shang, Peijie Liu, Yuwei Yan, Zijing Wu, Leheng Sheng, Yuanqing Yu, Chumeng Jiang, An Zhang, Fengli Xu, Yu Wang, Min Zhang, Yong Li

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Agentic, LLM-based recommender pipelines with platform-aware feature extraction can deliver large gains in top-N accuracy on text-rich platforms, and the public benchmark helps measure real-world improvements quickly.

Who Should Care

Summary TLDR

This paper introduces AgentRecBench: a public text-based simulator, dataset merge (Yelp, Goodreads, Amazon), and a modular agent framework to evaluate LLM-powered recommender agents. It defines three scenarios — classic, evolving-interest, and cold-start — and compares 10+ agents and traditional baselines. Strong agent designs (e.g., Baseline666) beat MF/LightGCN on many tasks. The benchmark ran an open challenge (295 teams, 1,400+ submissions) and is available with a leaderboard.

Problem Statement

There is no standard way to test LLM-driven, agentic recommender systems across practical challenges like time-varying interests and cold-starts. That gap makes it hard to compare designs, reproduce results, and give concrete engineering guidance.

Main Contribution

A textual interaction simulator that merges Yelp, Goodreads, and Amazon into a unified User-Review-Item (U-R-I) network with a standardized query API.

Three evaluation scenarios: classic recommendation, evolving-interest (long/short windows), and user/item cold-start.

Key Findings

Well-engineered agentic systems substantially outperform classical baselines on the benchmarked tasks.

NumbersBaseline666 HR@N up to 69.0% vs MF/LightGCN 15.0% (Amazon, classic)

Practical UseIf you switch from MF/LightGCN to a strong agentic workflow with platform-aware feature engineering, expect large gains in top-N hit rates on similar text-rich datasets.

Evidence RefTable 2

Agent designs that extract platform-specific item and review features drive the best results.

NumbersTop agents (Baseline666/DummyAgent/RecHackers) score 4471% HR@N on many Amazon/Goodreads splits vs 15% baseline

Practical UsePrioritize item- and review-side feature engineering and platform-adaptive pipelines before changing model family.

Evidence RefTables 2, 5, 6; Appendix case studies

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
HR@N (Top-N hit rate)Baseline666 up to 69.0%MF / LightGCN 15.0%+54.0 ppAmazon (classic, Table 2)Table 2 reports Baseline666 69.0% vs MF 15.0%Table 2
HR@N (cold-start user)Baseline666 48.7%MF 15.0%+33.7 ppAmazon (cold-start, Table 3)Table 3 shows Baseline666 48.7% vs MF 15.0%Table 3

What To Try In 7 Days

Run the AgentRecBench simulator on a small sample and compare a simple agent vs LightGCN using HR@1/3/5.

Prototype item- and review-side feature extraction for your platform and plug features into a BaseAgent workflow.

Submit a baseline to the public leaderboard or mirror its evaluation to track incremental gains.

Agent Features

Memory
short-term / episodic memory modulememory retrieval to inform decisions
Planning
dynamic planning / task decompositioninformation-seeking actions
Tool Use
standardized query APIU-R-I retrieval calls
Frameworks
dynamic planningcomplex reasoning (CoT)tool utilizationmemory management
Is Agentic

Yes

Architectures
LLM agent (instruction-tuned families)modular agent framework

Reproducibility

Risks & Boundaries

Limitations

Environment is text-only; no images or video support yet.

Current benchmark focuses on single-agent pipelines; multi-agent interactions are not evaluated.

When Not To Use

When your product relies heavily on images or video signals.

When you need to evaluate multi-agent or competitive agent setups.

Failure Modes

Poor performance on domains with sparse textual signals (low Yelp HRs reported).

High variance across datasets and LLM families; top agents may require heavy feature engineering.

Core Entities

Models

Qwen-72B-InstructDeepSeekv3GPT-4o-miniBaseline666DummyAgentRecHackersAgent4RecBaseAgentCoTAgentMemoryAgentCoTMemAgentMFLightGCN

Metrics

HR@N (Hit Rate@N)Accuracy

Datasets

AmazonGoodreadsYelp

Benchmarks

AgentRecBench