AgentRecBench: first public benchmark and simulator for LLM-based agentic recommender systems

May 26, 20257 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

0

Authors

Yu Shang, Peijie Liu, Yuwei Yan, Zijing Wu, Leheng Sheng, Yuanqing Yu, Chumeng Jiang, An Zhang, Fengli Xu, Yu Wang, Min Zhang, Yong Li

Links

Abstract / PDF

Why It Matters For Business

Agentic, LLM-based recommender pipelines with platform-aware feature extraction can deliver large gains in top-N accuracy on text-rich platforms, and the public benchmark helps measure real-world improvements quickly.

Summary TLDR

This paper introduces AgentRecBench: a public text-based simulator, dataset merge (Yelp, Goodreads, Amazon), and a modular agent framework to evaluate LLM-powered recommender agents. It defines three scenarios — classic, evolving-interest, and cold-start — and compares 10+ agents and traditional baselines. Strong agent designs (e.g., Baseline666) beat MF/LightGCN on many tasks. The benchmark ran an open challenge (295 teams, 1,400+ submissions) and is available with a leaderboard.

Problem Statement

There is no standard way to test LLM-driven, agentic recommender systems across practical challenges like time-varying interests and cold-starts. That gap makes it hard to compare designs, reproduce results, and give concrete engineering guidance.

Main Contribution

A textual interaction simulator that merges Yelp, Goodreads, and Amazon into a unified User-Review-Item (U-R-I) network with a standardized query API.

Three evaluation scenarios: classic recommendation, evolving-interest (long/short windows), and user/item cold-start.

A modular agent framework with four core modules: planning, reasoning, tool use, and memory management.

A benchmark comparison of 10+ methods and a public leaderboard validated through the AgentSociety Challenge.

Key Findings

Well-engineered agentic systems substantially outperform classical baselines on the benchmarked tasks.

NumbersBaseline666 HR@N up to 69.0% vs MF/LightGCN 15.0% (Amazon, classic)

Agent designs that extract platform-specific item and review features drive the best results.

NumbersTop agents (Baseline666/DummyAgent/RecHackers) score 44–71% HR@N on many Amazon/Goodreads splits vs 15% baseline

Agentic systems remain relatively stronger under evolving interests and cold-starts but performance drops overall in data-sparse settings.

NumbersCold-start: Baseline666 ~48.7% (Amazon user) vs MF 15.0%; evolving short/long windows: Baseline666 50.6/55.3% vs MF 38.4

Community validation shows practical utility: the benchmark spurred measurable improvements in solutions.

NumbersAgentSociety Challenge: 295 teams, 1,400+ submissions; 20.3% dev improvement and 15.9% further final gain

Results

HR@N (Top-N hit rate)

ValueBaseline666 up to 69.0%

BaselineMF / LightGCN 15.0%

HR@N (cold-start user)

ValueBaseline666 48.7%

BaselineMF 15.0%

Community improvement (challenge)

Value20.3% (Dev) then +15.9% (Final)

Baselineinitial challenge baselines

Who Should Care

What To Try In 7 Days

Run the AgentRecBench simulator on a small sample and compare a simple agent vs LightGCN using HR@1/3/5.

Prototype item- and review-side feature extraction for your platform and plug features into a BaseAgent workflow.

Submit a baseline to the public leaderboard or mirror its evaluation to track incremental gains.

Agent Features

Memory

  • short-term / episodic memory module
  • memory retrieval to inform decisions

Planning

  • dynamic planning / task decomposition
  • information-seeking actions

Tool Use

  • standardized query API
  • U-R-I retrieval calls

Frameworks

  • dynamic planning
  • complex reasoning (CoT)
  • tool utilization
  • memory management

Is Agentic

true

Architectures

  • LLM agent (instruction-tuned families)
  • modular agent framework

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Environment is text-only; no images or video support yet.
  • Current benchmark focuses on single-agent pipelines; multi-agent interactions are not evaluated.
  • Paper emphasizes agentic methods; more traditional deep-learning baselines could be expanded.

When Not To Use

  • When your product relies heavily on images or video signals.
  • When you need to evaluate multi-agent or competitive agent setups.
  • For offline-only systems where agentic information-seeking is unavailable.

Failure Modes

  • Poor performance on domains with sparse textual signals (low Yelp HRs reported).
  • High variance across datasets and LLM families; top agents may require heavy feature engineering.
  • Agentic systems still degrade under extreme cold-start or extremely short interaction windows.

Core Entities

Models

  • Qwen-72B-Instruct
  • DeepSeekv3
  • GPT-4o-mini
  • Baseline666
  • DummyAgent
  • RecHackers
  • Agent4Rec
  • BaseAgent
  • CoTAgent
  • MemoryAgent
  • CoTMemAgent
  • MF
  • LightGCN

Metrics

  • HR@N (Hit Rate@N)
  • Accuracy

Datasets

  • Amazon
  • Goodreads
  • Yelp

Benchmarks

  • AgentRecBench