A conversational LLM agent that automates buyer and seller workflows on a C2C marketplace, cutting interaction time and automating multi‑tap

Overview

Decision SnapshotNeeds Validation

Results are promising but come from synthetic datasets and an LLM-based simulator that used the same model as the agent. Field testing is required for production claims.

Citations0

Evidence Strength0.60

Confidence0.80

Risk Signals10

Trust Signals

Findings with numeric evidence: 3/4

Findings with evidence refs: 4/4

Results with explicit delta: 2/6

Reproducibility

Status: No open assets linked

Open source: Unknown

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Yineng Yan, Xidong Wang, Jin Seng Cheng, Ran Hu, Wentao Guan, Nahid Farahmand, Hengte Lin, Yue Li

Links

Abstract / PDF

Why It Matters For Business

FaMA can reduce seller and buyer time on common tasks, improve scalability of messaging and search, and provide a safer conversational interface that reduces user errors. Measured gains are promising but come from synthetic tests and short timing studies, so expect differences in production.

Who Should Care

Product Manager ML Engineer Engineering Lead CTO Founder

Summary TLDR

FaMA is a conversational assistant built on Llama-4 that turns marketplace GUI workflows into natural-language commands. It uses a short-term 'scratchpad' memory, tool calling (listings, search, messaging), and an optional RAG tool for policy/help lookup. In synthetic tests FaMA solved tasks with ~98% success and halved interaction time for bulk replies. The evaluation is synthetic and uses an LLM-based simulator, so real-world gains may vary.

Problem Statement

C2C marketplaces are full of repetitive, multi-step GUI tasks (listing creation/renewal, bulk replies, filtered search). These tasks are slow and error-prone on mobile UIs. Users need a simpler, conversational entry point that can understand natural requests and operate platform tools safely.

Main Contribution

Design and implementation of FaMA: an LLM-based conversational assistant with tool calling, scratchpad short-term memory, and a RAG help tool.

A single-step interactive ReAct-style loop that asks users to confirm each state-changing action for safety.

Key Findings

High automated task success on the synthetic evaluation.

Numbers98% task success rate (synthetic 100-listing eval)

Practical UseExpect strong automation for typical marketplace workflows in controlled settings; validate on real users before rollout because the test used synthetic data and an LLM-based simulator.

Evidence RefSection 4.1; Figure 3

Bulk replies can be much faster with the agent.

NumbersBulk messages: 25s with FaMA vs 50s manual (2x speedup)

Practical UseDeploying a messaging tool can roughly halve time for batch replies, reducing seller time on repetitive communication.

Evidence RefSection 4.2; Table 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Overall Task Success Rate (automated eval)	98% success on evaluated tasks	—	—	synthetic 100-listing dataset	FaMA achieved ~98%+ overall success in automated evaluation	Section 4.1; Figure 3
Inventory Search success	98% success; 100% of successful attempts in single optimal step	—	—	synthetic 100-listing dataset	Single-step Inventory Search: 98% success and 100% optimality	Section 4.1; Figure 3

What To Try In 7 Days

Prototype a conversational entry point for one seller workflow (e.g., bulk replies) and measure time saved.

Add a scratchpad-style short-term memory to preserve multi-step state across confirmations.

Wrap three essential platform APIs (search, update listing, messaging) as callable tools for the LLM and test with synthetic scenarios.

Agent Features

Memory

Scratchpad chronological Thought-Action-Observation log (short-term)Ephemeral dialog history (session-based purge)Listings Information Memory (title, desc, ID per session)

Planning

ReAct Thought-Action-Observation planningChain-of-Thought prompting for reasoning

Tool Use

Listing operation tools (create/update/renew)Inventory search tool (marketplace search API)Messaging tools (single and bulk)RAG-as-Tool for help articlesASR front-end for voice

Frameworks

ReActRAGTool callingASR

Is Agentic

Yes

Architectures

Single-step interactive ReAct loop with user confirmationLLM core: Llama-4-Maverick-17B-128E-Instruct

Reproducibility

Code AvailableNo

Data AvailableNo

Open Source StatusUnknown

LicenseUnknown

Risks & Boundaries

Limitations

Evaluation uses a synthetic 100-listing dataset and an LLM-based user simulator, which can overestimate real-world performance.

Both agent and simulator use the same LLM, creating potential evaluation bias.

When Not To Use

Workflows that require persistent long-term memory across sessions.

High-risk operations needing strict audit trails without human confirmation.

Failure Modes

Misidentifying the target listing from ambiguous user text, especially outside session-stored listings.

LLM hallucinations when calling tools or synthesizing policy answers without reliable RAG grounding.

Core Entities

Models

Llama-4-Maverick-17B-128E-Instruct

Metrics

Task Success RateTask Optimality RateInteraction TimeSpeedup

Datasets

synthetic_100_listings_dataset (LLM-generated)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

High automated task success on the synthetic evaluation.

Bulk replies can be much faster with the agent.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Survey: Reframe LLMs as agents that plan, act, and continually learn

Key finding

Survey of safe interfaces, threat models, and standards for LLM-driven agents that act on blockchains

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

TrustBench: a runtime safety gate for agents that cuts harmful actions and runs in under 200 ms

Key finding

Prevent mistakes before they happen: add per-agent pre-action checks plus post-action learning to multi-agent tool workflows.

Key finding