An LLM trading agent that uses working + layered long-term memory and a dynamic trader profile to beat standard baselines on backtests

Overview

Decision SnapshotNeeds Validation

The paper backs claims with multi-stock backtests and ablations, but results are limited to historical backtests, specific tickers, and use commercial LLMs; live-trading safety and transaction costs are not evaluated.

Citations7

Evidence Strength0.80

Confidence0.82

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 70%

Authors

Yangyang Yu, Haohang Li, Zhi Chen, Yuechen Jiang, Yang Li, Denghui Zhang, Rong Liu, Jordan W. Suchow, Khaldoun Khashanah

Links

Abstract / PDF / Code

Why It Matters For Business

FINMEM shows LLM agents with structured, time-aware memory can produce better risk-adjusted returns in backtests while using shorter training histories—helpful for trading newer stocks or fast deployment.

Who Should Care

CTO Product Manager ML Engineer Data Scientist Founder

Summary TLDR

FINMEM is an LLM-driven single-stock trading agent that adds a human-like memory system (working memory plus shallow/intermediate/deep long-term layers) and a dynamic character (three risk profiles, including self-adaptive). On historical backtests across five stocks, FINMEM (with GPT-4 / GPT-4-Turbo) produced substantially higher cumulative returns and Sharpe Ratios than Buy-and-Hold, several DRL agents, and two other LLM agents. Key knobs that change results are the backbone LLM, the working-memory retrieval size (TopK), and the risk-inclination profile.

Problem Statement

Existing trading agents either lack interpretability (many DRL systems) or lack structured memory and time-aware handling of financial signals (existing LLM agents treat incoming data indiscriminately). Traders and automated systems need an agent that (1) remembers and weights events by their time-sensitivity and importance, (2) adapts risk stance dynamically, and (3) can learn from a short, multi-source historical window.

Main Contribution

FINMEM architecture: profiling (dynamic character + risk profiles), working memory, and layered long-term memory (shallow/intermediate/deep) tailored for finance.

Memory scoring and promotion: recency, relevancy (embeddings), and importance with layer-specific decay and promotion rules.

Key Findings

FINMEM achieved the highest backtested cumulative return and risk-adjusted performance across tested stocks.

NumbersTSLA cumulative return = 61.78%, Sharpe = 2.6789 (Table 2)

Practical UseUse FINMEM-style memory + LLM backbone to improve single-stock trading returns in backtests; expect higher returns per unit risk on similar historical data when using GPT-4-family models.

Evidence RefTable 2

FINMEM needs substantially less historical training time than DRL agents to reach strong performance.

NumbersRobust results from 6 months training (Aug 2021–Feb 2022) produced top cumulative returns during test period (Figure 7)

Practical UseFor new IPOs or short histories, prefer FINMEM-style LLM agents over DRL, since they can learn from shorter multi-source windows.

Evidence RefSection 5.2; Figure 7

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Cumulative Return (TSLA, testing)	61.7758%	Buy-and-Hold -18.6312%	+80.407%	TSLA (Oct 06, 2022 – Apr 10, 2023)	Table 2; reported average over 5 trials	Table 2
Sharpe Ratio (TSLA, testing)	2.6789	Buy-and-Hold -0.5410	+3.2199	TSLA	Table 2 (average over trials)	Table 2

What To Try In 7 Days

Prototype a memory+LLM workflow: store summaries in a vector DB (FAISS) and retrieve topK per time-layer.

Run backtests comparing a self-adaptive risk prompt vs fixed risk text prompts on one ticker.

Tune TopK retrieval (start with K=5) and compare cumulative return and drawdown on historical data.

Agent Features

Memory

working memory (summarize, observe, reflect)layered long-term memory (shallow/intermediate/deep)promotion of important events between layers

Planning

immediate reflection (daily decision)extended reflection (M-day retrospection)

Tool Use

FAISS vector DB for retrievalOpenAI embeddings for relevancy scoringGuardrails AI for action validation

Frameworks

Generative Agents (inspired)RAG via vector retrieval into LLM prompts

Is Agentic

Yes

Architectures

LLM backbone with modular agent structureprofiling + working memory + layered long-term memorydecision-making (immediate and extended reflection)

Optimization Features

Token Efficiency

summaries fed into memory reduce raw prompt size

Training Optimization

short training window (6 months) shown sufficient in experiments

Reproducibility

Code AvailableYes

Data AvailableNo

Open Source StatusPartial

LicenseUnknown

Code URLs

Source code referenced in paper: 'FINMEM LLM Trading'

Risks & Boundaries

Limitations

Results come from historical backtests; no live trading or transaction-cost analysis.

Experiments use general-purpose LLMs and limited data quality; performance may change with market regimes.

When Not To Use

For high-frequency or tick-level trading where millisecond latency matters.

When transaction costs and slippage are critical and not simulated.

Failure Modes

LLM hallucinations or incorrect summarizations leading to wrong trades.

Memory-promotion rules elevating misleading events and causing persistent bias.

Core Entities

Models

GPT-4GPT-4-TurboGPT-3.5-Turbodavinci-003Llama2-70b-chatFinGPTGenerative AgentsPPODQNA2CBuy-and-Hold

Metrics

Cumulative ReturnSharpe RatioDaily VolatilityAnnualized VolatilityMax Drawdown

Datasets

Yahoo Finance OHLCVAlpaca News API (Benzinga)Company 10-Q and 10-K filingsFINMEM Layered Long-term Memory (FAISS)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

FINMEM achieved the highest backtested cumulative return and risk-adjusted performance across tested stocks.

FINMEM needs substantially less historical training time than DRL agents to reach strong performance.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

You May Also Want to Read

Survey of how LLMs become autonomous agents, the core architecture, and the research gaps to make them safe and practical.

Key finding

Agentic ROI: prioritize real user value, not raw model scores

Key finding

Hierarchical multi-agent research agent that compresses long context, routes subtasks to specialized tools, and self-corrects failures.

Key finding

Declarative agent spec plus a runtime that enforces safety, memory, and low-latency execution

Key finding

Jointly erase private facts from an LLM agent's weights and persistent memory to stop recontamination

Key finding