Overview
Production Readiness
0.7
Novelty Score
0.5
Cost Impact Score
0.8
Citation Count
0
Why It Matters For Business
You can host multi-step, tool-using LLM agents on serverless platforms to cut latency, token bills, and operational overhead—if you add durable memory, S3 caching, and carefully deploy MCP tools.
Summary TLDR
FAME is a serverless architecture that runs multi-agent LLM workflows (ReAct pattern) as AWS Lambda functions orchestrated with Step Functions. It stores agent state in DynamoDB, wraps MCP tool servers as Lambda functions, caches deterministic tool outputs in S3, and supports function fusion. On two apps (paper summarization, log analytics) FAME achieves up to 13× lower latency, up to 88% fewer input tokens, and up to 66% lower monetary cost versus stateless baselines, while improving completion rates. Results come with caveats: tested on AWS, GPT-4o-mini, and a limited workload set.
Problem Statement
Hosting multi-step, tool-using agentic workflows on VMs is costly and hard to scale. FaaS gives autoscaling and billing per request but is stateless, which breaks multi-turn agent memory and increases redundant LLM calls and tool invocations. MCP tool servers also need special deployment and can cause cold-start and billing overheads when naively hosted.
Main Contribution
FAME: a FaaS architecture that decomposes ReAct agents into Planner, Actor, Evaluator Lambda functions orchestrated with AWS Step Functions.
Automated, durable agent memory injection using DynamoDB plus an automated wrapper to deploy FastMCP servers as Lambda functions.
Optimizations: S3-backed MCP invocation cache, S3 file handling to avoid token bloat, and function fusion (consolidation) strategies; validated on two reference applications.
Key Findings
Combined agent memory plus MCP caching reduced end-to-end latency by up to 13×
Input tokens to the LLM dropped by as much as 88%
Monetary costs fell by up to ≈66% across evaluated runs
MCP caching lowered MCP invocation latency by ≈28% and reduced tokens by ≈51%
Consolidating MCP servers reduced cold-start spikes and slightly improved stable MCP latency
Results
End-to-end latency
LLM input tokens
Monetary cost
MCP invocation latency
Cold-start & stable MCP latency
Who Should Care
What To Try In 7 Days
Break a ReAct agent into Planner/Actor/Evaluator as separate functions and orchestrate with Step Functions.
Persist session agent state in DynamoDB and inject it into Planner prompts for multi-turn continuity.
Wrap deterministic MCP tools to return S3 handles and enable TTL-based caching to reduce token and latency costs.
Agent Features
Memory
- Client-side cumulative messages (naive)
- Durable agent memory in DynamoDB (session keyed)
- Configurable TTLs and pre-processing (summarize) suggested
Planning
- LLM-generated multi-step plans (Planner role)
- Iterative re-planning via Evaluator feedback
Tool Use
- MCP servers accessed via HTTP Lambda URLs
- FastMCP interface for tool definitions
Frameworks
- LangGraph
- AWS Lambda / Step Functions
- Anthropic MCP / FastMCP
Is Agentic
true
Architectures
- ReAct decomposed into Planner-Actor-Evaluator
- LangGraph-based agent graphs
Collaboration
- Orchestration via AWS Step Functions message passing
- Shared state passed between agent functions
Optimization Features
Token Efficiency
- Agent memory + MCP cache led to up to 88% input token reduction
- S3 file handling prevents passing large documents to LLM
Infra Optimization
- Singleton vs consolidated MCP deployments examined
- Tuning Lambda memory sizes per MCP to trade cost vs cold-starts
System Optimization
- Automated Lambda wrapper generation for MCP servers
- Separate configuration per agent role to avoid timeouts and overprovisioning
Inference Optimization
- S3-backed MCP invocation cache to avoid reruns
- Return S3 URLs instead of inlining large files
- Function fusion (consolidate MCPs) to reduce cold starts
- Prompt tweaks to encourage memory reuse
Reproducibility
Data Urls
- ArXiv papers used (references in §4.1)
- LogHub-style log files (Apache, Hadoop, OpenSSH) §4.1.2
Data Available
Open Source Status
- unknown
Risks & Boundaries
Limitations
- Experiments limited to two reference apps and a small set of MCP tools.
- Evaluation done on AWS Lambda/Step Functions and GPT-4o-mini—results may differ on other clouds or LLMs.
- LLM non-determinism caused occasional failed runs despite optimizations.
- Cache correctness depends on TTL choices; stale caches can break correctness.
When Not To Use
- Workloads with long-running tools that exceed FaaS timeouts without async patterns.
- Very high-throughput, GPU-bound inference where specialized inference infra is required.
- Environments with strict data residency or no S3/DynamoDB access.
Failure Modes
- LLM ignores injected memory and reissues redundant tool calls.
- Misconfigured cache TTL returns stale or incorrect tool outputs.
- Cold-start spikes for many small singleton MCP Lambdas increase latency.
- Consolidation increases per-invocation memory cost if tools need large memory.
Core Entities
Models
- GPT-4o-mini
Metrics
- End-to-End latency
- Input tokens
- LLM cost (¢)
- FaaS execution cost
- MCP latency
- Tool call count
- Completion / DNF rates
Datasets
- ArXiv papers (P1-P3 examples)
- LogHub-style logs (Apache, Hadoop, OpenSSH sample files)

