A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

September 4, 20258 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Yineng Yan, Xidong Wang, Jin Seng Cheng, Ran Hu, Wentao Guan, Nahid Farahmand, Hengte Lin, Yue Li

Links

Abstract / PDF

Why It Matters For Business

FaMA can reduce seller and buyer time on common tasks, improve scalability of messaging and search, and provide a safer conversational interface that reduces user errors. Measured gains are promising but come from synthetic tests and short timing studies, so expect differences in production.

Summary TLDR

FaMA is a conversational assistant built on Llama-4 that turns marketplace GUI workflows into natural-language commands. It uses a short-term 'scratchpad' memory, tool calling (listings, search, messaging), and an optional RAG tool for policy/help lookup. In synthetic tests FaMA solved tasks with ~98% success and halved interaction time for bulk replies. The evaluation is synthetic and uses an LLM-based simulator, so real-world gains may vary.

Problem Statement

C2C marketplaces are full of repetitive, multi-step GUI tasks (listing creation/renewal, bulk replies, filtered search). These tasks are slow and error-prone on mobile UIs. Users need a simpler, conversational entry point that can understand natural requests and operate platform tools safely.

Main Contribution

Design and implementation of FaMA: an LLM-based conversational assistant with tool calling, scratchpad short-term memory, and a RAG help tool.

A single-step interactive ReAct-style loop that asks users to confirm each state-changing action for safety.

Automated evaluation on a synthetic 100-listing dataset showing high task success and a timing study showing up to 2x speedup on common tasks.

Key Findings

High automated task success on the synthetic evaluation.

Numbers98% task success rate (synthetic 100-listing eval)

Bulk replies can be much faster with the agent.

NumbersBulk messages: 25s with FaMA vs 50s manual (2x speedup)

Inventory search and renew workflows complete efficiently and often in minimal steps.

NumbersInventory Search: 98% success, 100% optimal; Renew Listing: 100% success; multi-step optimality >84%

Design trades off autonomy for safety via explicit confirmations.

Results

Overall Task Success Rate (automated eval)

Value98% success on evaluated tasks

Inventory Search success

Value98% success; 100% of successful attempts in single optimal step

Renew Listing success

Value100% success in evaluation

Bulk Reply success

Valueover 96% success; >84% optimality for multi-step tasks

Interaction time (Bulk Messages Reply)

Value25 sec with FaMA vs 50 sec manual

Baselinemanual mobile app

Interaction time (Inventory Search)

Value15 sec with FaMA vs 25 sec manual

Baselinemanual mobile app

Who Should Care

What To Try In 7 Days

Prototype a conversational entry point for one seller workflow (e.g., bulk replies) and measure time saved.

Add a scratchpad-style short-term memory to preserve multi-step state across confirmations.

Wrap three essential platform APIs (search, update listing, messaging) as callable tools for the LLM and test with synthetic scenarios.

Agent Features

Memory

  • Scratchpad chronological Thought-Action-Observation log (short-term)
  • Ephemeral dialog history (session-based purge)
  • Listings Information Memory (title, desc, ID per session)

Planning

  • ReAct Thought-Action-Observation planning
  • Chain-of-Thought prompting for reasoning

Tool Use

  • Listing operation tools (create/update/renew)
  • Inventory search tool (marketplace search API)
  • Messaging tools (single and bulk)
  • RAG-as-Tool for help articles
  • ASR front-end for voice

Frameworks

  • ReAct
  • RAG
  • Tool calling
  • ASR

Is Agentic

true

Architectures

  • Single-step interactive ReAct loop with user confirmation
  • LLM core: Llama-4-Maverick-17B-128E-Instruct

Reproducibility

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Evaluation uses a synthetic 100-listing dataset and an LLM-based user simulator, which can overestimate real-world performance.
  • Both agent and simulator use the same LLM, creating potential evaluation bias.
  • Session-based ephemeral memory limits long-term personalization and persistent workflows.
  • Single-step confirmation improves safety but increases interaction overhead for users who want full automation.

When Not To Use

  • Workflows that require persistent long-term memory across sessions.
  • High-risk operations needing strict audit trails without human confirmation.
  • Environments with low API reliability or where tool calls are restricted.

Failure Modes

  • Misidentifying the target listing from ambiguous user text, especially outside session-stored listings.
  • LLM hallucinations when calling tools or synthesizing policy answers without reliable RAG grounding.
  • Degraded performance in real-world, noisy user conversations compared to synthetic simulator.

Core Entities

Models

  • Llama-4-Maverick-17B-128E-Instruct

Metrics

  • Task Success Rate
  • Task Optimality Rate
  • Interaction Time
  • Speedup

Datasets

  • synthetic_100_listings_dataset (LLM-generated)