An LLM trading agent that uses working + layered long-term memory and a dynamic trader profile to beat standard baselines on backtests

November 23, 20238 min

Overview

Decision SnapshotNeeds Validation

The paper backs claims with multi-stock backtests and ablations, but results are limited to historical backtests, specific tickers, and use commercial LLMs; live-trading safety and transaction costs are not evaluated.

Citations7

Evidence Strength0.80

Confidence0.82

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 6/6

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 50%

Novelty: 70%

Authors

Yangyang Yu, Haohang Li, Zhi Chen, Yuechen Jiang, Yang Li, Denghui Zhang, Rong Liu, Jordan W. Suchow, Khaldoun Khashanah

Links

Abstract / PDF / Code

Why It Matters For Business

FINMEM shows LLM agents with structured, time-aware memory can produce better risk-adjusted returns in backtests while using shorter training histories—helpful for trading newer stocks or fast deployment.

Who Should Care

Summary TLDR

FINMEM is an LLM-driven single-stock trading agent that adds a human-like memory system (working memory plus shallow/intermediate/deep long-term layers) and a dynamic character (three risk profiles, including self-adaptive). On historical backtests across five stocks, FINMEM (with GPT-4 / GPT-4-Turbo) produced substantially higher cumulative returns and Sharpe Ratios than Buy-and-Hold, several DRL agents, and two other LLM agents. Key knobs that change results are the backbone LLM, the working-memory retrieval size (TopK), and the risk-inclination profile.

Problem Statement

Existing trading agents either lack interpretability (many DRL systems) or lack structured memory and time-aware handling of financial signals (existing LLM agents treat incoming data indiscriminately). Traders and automated systems need an agent that (1) remembers and weights events by their time-sensitivity and importance, (2) adapts risk stance dynamically, and (3) can learn from a short, multi-source historical window.

Main Contribution

FINMEM architecture: profiling (dynamic character + risk profiles), working memory, and layered long-term memory (shallow/intermediate/deep) tailored for finance.

Memory scoring and promotion: recency, relevancy (embeddings), and importance with layer-specific decay and promotion rules.

Key Findings

FINMEM achieved the highest backtested cumulative return and risk-adjusted performance across tested stocks.

NumbersTSLA cumulative return = 61.78%, Sharpe = 2.6789 (Table 2)

Practical UseUse FINMEM-style memory + LLM backbone to improve single-stock trading returns in backtests; expect higher returns per unit risk on similar historical data when using GPT-4-family models.

Evidence RefTable 2

FINMEM needs substantially less historical training time than DRL agents to reach strong performance.

NumbersRobust results from 6 months training (Aug 2021–Feb 2022) produced top cumulative returns during test period (Figure 7)

Practical UseFor new IPOs or short histories, prefer FINMEM-style LLM agents over DRL, since they can learn from shorter multi-source windows.

Evidence RefSection 5.2; Figure 7

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Cumulative Return (TSLA, testing)61.7758%Buy-and-Hold -18.6312%+80.407%TSLA (Oct 06, 2022 – Apr 10, 2023)Table 2; reported average over 5 trialsTable 2
Sharpe Ratio (TSLA, testing)2.6789Buy-and-Hold -0.5410+3.2199TSLATable 2 (average over trials)Table 2

What To Try In 7 Days

Prototype a memory+LLM workflow: store summaries in a vector DB (FAISS) and retrieve topK per time-layer.

Run backtests comparing a self-adaptive risk prompt vs fixed risk text prompts on one ticker.

Tune TopK retrieval (start with K=5) and compare cumulative return and drawdown on historical data.

Agent Features

Memory
working memory (summarize, observe, reflect)layered long-term memory (shallow/intermediate/deep)promotion of important events between layers
Planning
immediate reflection (daily decision)extended reflection (M-day retrospection)
Tool Use
FAISS vector DB for retrievalOpenAI embeddings for relevancy scoringGuardrails AI for action validation
Frameworks
Generative Agents (inspired)RAG via vector retrieval into LLM prompts
Is Agentic

Yes

Architectures
LLM backbone with modular agent structureprofiling + working memory + layered long-term memorydecision-making (immediate and extended reflection)

Optimization Features

Token Efficiency
summaries fed into memory reduce raw prompt size
Training Optimization
short training window (6 months) shown sufficient in experiments

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseUnknown

Code URLs

Source code referenced in paper: 'FINMEM LLM Trading'

Risks & Boundaries

Limitations

Results come from historical backtests; no live trading or transaction-cost analysis.

Experiments use general-purpose LLMs and limited data quality; performance may change with market regimes.

When Not To Use

For high-frequency or tick-level trading where millisecond latency matters.

When transaction costs and slippage are critical and not simulated.

Failure Modes

LLM hallucinations or incorrect summarizations leading to wrong trades.

Memory-promotion rules elevating misleading events and causing persistent bias.

Core Entities

Models

GPT-4GPT-4-TurboGPT-3.5-Turbodavinci-003Llama2-70b-chatFinGPTGenerative AgentsPPODQNA2CBuy-and-Hold

Metrics

Cumulative ReturnSharpe RatioDaily VolatilityAnnualized VolatilityMax Drawdown

Datasets

Yahoo Finance OHLCVAlpaca News API (Benzinga)Company 10-Q and 10-K filingsFINMEM Layered Long-term Memory (FAISS)