Serverless FaaS for agentic workflows cuts latency 13×, tokens 88%, and cost 66%

January 21, 20269 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

0

Authors

Varad Kulkarni, Vaibhav Jha, Nikhil Reddy, Anand Eswaran, Praveen Jayachandran, Yogesh Simmhan

Links

Abstract / PDF

Why It Matters For Business

You can host multi-step, tool-using LLM agents on serverless platforms to cut latency, token bills, and operational overhead—if you add durable memory, S3 caching, and carefully deploy MCP tools.

Summary TLDR

FAME is a serverless architecture that runs multi-agent LLM workflows (ReAct pattern) as AWS Lambda functions orchestrated with Step Functions. It stores agent state in DynamoDB, wraps MCP tool servers as Lambda functions, caches deterministic tool outputs in S3, and supports function fusion. On two apps (paper summarization, log analytics) FAME achieves up to 13× lower latency, up to 88% fewer input tokens, and up to 66% lower monetary cost versus stateless baselines, while improving completion rates. Results come with caveats: tested on AWS, GPT-4o-mini, and a limited workload set.

Problem Statement

Hosting multi-step, tool-using agentic workflows on VMs is costly and hard to scale. FaaS gives autoscaling and billing per request but is stateless, which breaks multi-turn agent memory and increases redundant LLM calls and tool invocations. MCP tool servers also need special deployment and can cause cold-start and billing overheads when naively hosted.

Main Contribution

FAME: a FaaS architecture that decomposes ReAct agents into Planner, Actor, Evaluator Lambda functions orchestrated with AWS Step Functions.

Automated, durable agent memory injection using DynamoDB plus an automated wrapper to deploy FastMCP servers as Lambda functions.

Optimizations: S3-backed MCP invocation cache, S3 file handling to avoid token bloat, and function fusion (consolidation) strategies; validated on two reference applications.

Key Findings

Combined agent memory plus MCP caching reduced end-to-end latency by up to 13×

Numbersup to 13× speedup (§5.2.1 Fig.4; M+C vs baselines)

Input tokens to the LLM dropped by as much as 88%

Numbersinput tokens reduced ≈88% (§5.2.2 Fig.5, P1 example 35,646→4,536)

Monetary costs fell by up to ≈66% across evaluated runs

Numberscosts reduced ≈66% (§5.2.3 Fig.6; M/C/M+C vs E/N)

MCP caching lowered MCP invocation latency by ≈28% and reduced tokens by ≈51%

NumbersMCP latency ≈28% lower; token drop ≈51% (§5.3.1 Fig.7a)

Consolidating MCP servers reduced cold-start spikes and slightly improved stable MCP latency

NumbersRS stable MCP time 10.2s (singular) → 7.9s (consolidated) (§5.3.2 Fig.7b)

Results

End-to-end latency

Valueup to 13× reduction

BaselineEmpty/naive memory configs

LLM input tokens

Valueup to 88% reduction

BaselineEmpty/naive memory configs

Monetary cost

Valueup to ≈66% reduction

BaselineEmpty/naive memory configs

MCP invocation latency

Value≈28% improvement with caching

BaselineNo MCP caching

Cold-start & stable MCP latency

Valuereduced cold-start spikes; improved stable latency

Baselinesingular MCP Lambdas

Who Should Care

What To Try In 7 Days

Break a ReAct agent into Planner/Actor/Evaluator as separate functions and orchestrate with Step Functions.

Persist session agent state in DynamoDB and inject it into Planner prompts for multi-turn continuity.

Wrap deterministic MCP tools to return S3 handles and enable TTL-based caching to reduce token and latency costs.

Agent Features

Memory

  • Client-side cumulative messages (naive)
  • Durable agent memory in DynamoDB (session keyed)
  • Configurable TTLs and pre-processing (summarize) suggested

Planning

  • LLM-generated multi-step plans (Planner role)
  • Iterative re-planning via Evaluator feedback

Tool Use

  • MCP servers accessed via HTTP Lambda URLs
  • FastMCP interface for tool definitions

Frameworks

  • LangGraph
  • AWS Lambda / Step Functions
  • Anthropic MCP / FastMCP

Is Agentic

true

Architectures

  • ReAct decomposed into Planner-Actor-Evaluator
  • LangGraph-based agent graphs

Collaboration

  • Orchestration via AWS Step Functions message passing
  • Shared state passed between agent functions

Optimization Features

Token Efficiency

  • Agent memory + MCP cache led to up to 88% input token reduction
  • S3 file handling prevents passing large documents to LLM

Infra Optimization

  • Singleton vs consolidated MCP deployments examined
  • Tuning Lambda memory sizes per MCP to trade cost vs cold-starts

System Optimization

  • Automated Lambda wrapper generation for MCP servers
  • Separate configuration per agent role to avoid timeouts and overprovisioning

Inference Optimization

  • S3-backed MCP invocation cache to avoid reruns
  • Return S3 URLs instead of inlining large files
  • Function fusion (consolidate MCPs) to reduce cold starts
  • Prompt tweaks to encourage memory reuse

Reproducibility

Data Urls

  • ArXiv papers used (references in §4.1)
  • LogHub-style log files (Apache, Hadoop, OpenSSH) §4.1.2

Data Available

Open Source Status

  • unknown

Risks & Boundaries

Limitations

  • Experiments limited to two reference apps and a small set of MCP tools.
  • Evaluation done on AWS Lambda/Step Functions and GPT-4o-mini—results may differ on other clouds or LLMs.
  • LLM non-determinism caused occasional failed runs despite optimizations.
  • Cache correctness depends on TTL choices; stale caches can break correctness.

When Not To Use

  • Workloads with long-running tools that exceed FaaS timeouts without async patterns.
  • Very high-throughput, GPU-bound inference where specialized inference infra is required.
  • Environments with strict data residency or no S3/DynamoDB access.

Failure Modes

  • LLM ignores injected memory and reissues redundant tool calls.
  • Misconfigured cache TTL returns stale or incorrect tool outputs.
  • Cold-start spikes for many small singleton MCP Lambdas increase latency.
  • Consolidation increases per-invocation memory cost if tools need large memory.

Core Entities

Models

  • GPT-4o-mini

Metrics

  • End-to-End latency
  • Input tokens
  • LLM cost (¢)
  • FaaS execution cost
  • MCP latency
  • Tool call count
  • Completion / DNF rates

Datasets

  • ArXiv papers (P1-P3 examples)
  • LogHub-style logs (Apache, Hadoop, OpenSSH sample files)