Overview
Results are promising but come from synthetic datasets and an LLM-based simulator that used the same model as the agent. Field testing is required for production claims.
Citations0
Evidence Strength0.60
Confidence0.80
Risk Signals10
Trust Signals
Findings with numeric evidence: 3/4
Findings with evidence refs: 4/4
Results with explicit delta: 2/6
Reproducibility
Status: No open assets linked
Open source: Unknown
At A Glance
Cost impact: 60%
Production readiness: 60%
Novelty: 50%
Why It Matters For Business
FaMA can reduce seller and buyer time on common tasks, improve scalability of messaging and search, and provide a safer conversational interface that reduces user errors. Measured gains are promising but come from synthetic tests and short timing studies, so expect differences in production.
Who Should Care
Summary TLDR
FaMA is a conversational assistant built on Llama-4 that turns marketplace GUI workflows into natural-language commands. It uses a short-term 'scratchpad' memory, tool calling (listings, search, messaging), and an optional RAG tool for policy/help lookup. In synthetic tests FaMA solved tasks with ~98% success and halved interaction time for bulk replies. The evaluation is synthetic and uses an LLM-based simulator, so real-world gains may vary.
Problem Statement
C2C marketplaces are full of repetitive, multi-step GUI tasks (listing creation/renewal, bulk replies, filtered search). These tasks are slow and error-prone on mobile UIs. Users need a simpler, conversational entry point that can understand natural requests and operate platform tools safely.
Main Contribution
Design and implementation of FaMA: an LLM-based conversational assistant with tool calling, scratchpad short-term memory, and a RAG help tool.
A single-step interactive ReAct-style loop that asks users to confirm each state-changing action for safety.
Key Findings
High automated task success on the synthetic evaluation.
Bulk replies can be much faster with the agent.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| Overall Task Success Rate (automated eval) | 98% success on evaluated tasks | — | — | synthetic 100-listing dataset | FaMA achieved ~98%+ overall success in automated evaluation | Section 4.1; Figure 3 |
| Inventory Search success | 98% success; 100% of successful attempts in single optimal step | — | — | synthetic 100-listing dataset | Single-step Inventory Search: 98% success and 100% optimality | Section 4.1; Figure 3 |
What To Try In 7 Days
Prototype a conversational entry point for one seller workflow (e.g., bulk replies) and measure time saved.
Add a scratchpad-style short-term memory to preserve multi-step state across confirmations.
Wrap three essential platform APIs (search, update listing, messaging) as callable tools for the LLM and test with synthetic scenarios.
Agent Features
Memory
Planning
Tool Use
Frameworks
Is Agentic
Yes
Architectures
Reproducibility
Risks & Boundaries
Limitations
Evaluation uses a synthetic 100-listing dataset and an LLM-based user simulator, which can overestimate real-world performance.
Both agent and simulator use the same LLM, creating potential evaluation bias.
When Not To Use
Workflows that require persistent long-term memory across sessions.
High-risk operations needing strict audit trails without human confirmation.
Failure Modes
Misidentifying the target listing from ambiguous user text, especially outside session-stored listings.
LLM hallucinations when calling tools or synthesizing policy answers without reliable RAG grounding.

