An LLM trading agent that uses working + layered long-term memory and a dynamic trader profile to beat standard baselines on backtests

November 23, 20238 min

Overview

Production Readiness

0.5

Novelty Score

0.7

Cost Impact Score

0.6

Citation Count

7

Authors

Yangyang Yu, Haohang Li, Zhi Chen, Yuechen Jiang, Yang Li, Denghui Zhang, Rong Liu, Jordan W. Suchow, Khaldoun Khashanah

Links

Abstract / PDF

Why It Matters For Business

FINMEM shows LLM agents with structured, time-aware memory can produce better risk-adjusted returns in backtests while using shorter training histories—helpful for trading newer stocks or fast deployment.

Summary TLDR

FINMEM is an LLM-driven single-stock trading agent that adds a human-like memory system (working memory plus shallow/intermediate/deep long-term layers) and a dynamic character (three risk profiles, including self-adaptive). On historical backtests across five stocks, FINMEM (with GPT-4 / GPT-4-Turbo) produced substantially higher cumulative returns and Sharpe Ratios than Buy-and-Hold, several DRL agents, and two other LLM agents. Key knobs that change results are the backbone LLM, the working-memory retrieval size (TopK), and the risk-inclination profile.

Problem Statement

Existing trading agents either lack interpretability (many DRL systems) or lack structured memory and time-aware handling of financial signals (existing LLM agents treat incoming data indiscriminately). Traders and automated systems need an agent that (1) remembers and weights events by their time-sensitivity and importance, (2) adapts risk stance dynamically, and (3) can learn from a short, multi-source historical window.

Main Contribution

FINMEM architecture: profiling (dynamic character + risk profiles), working memory, and layered long-term memory (shallow/intermediate/deep) tailored for finance.

Memory scoring and promotion: recency, relevancy (embeddings), and importance with layer-specific decay and promotion rules.

Empirical result: FINMEM (GPT-4 / GPT-4-Turbo) outperforms Buy-and-Hold, three DRL agents (PPO, DQN, A2C) and two LLM agents on five-stock backtests, while needing much shorter training windows.

Key Findings

FINMEM achieved the highest backtested cumulative return and risk-adjusted performance across tested stocks.

NumbersTSLA cumulative return = 61.78%, Sharpe = 2.6789 (Table 2)

FINMEM needs substantially less historical training time than DRL agents to reach strong performance.

NumbersRobust results from 6 months training (Aug 2021–Feb 2022) produced top cumulative returns during test period (Figure 7)

Backbone LLM matters: GPT-4 and GPT-4-Turbo gave the best results for FINMEM.

NumbersCumulative Return with GPT-4 = 62.62%; GPT-4-Turbo = 54.70% (Table 3)

Working-memory retrieval size (TopK) changes performance; moderate expansion helps but too large mixes noise.

NumbersTopK=5 gave Cumulative Return 54.70% and Sharpe 2.496; TopK=10 raised return to 79.44% but increased volatility and draw

Self-adaptive risk profile produced the best overall test results among risk settings.

NumbersSelf-adaptive Cumulative Return = 54.70%, Sharpe = 2.496 vs risk-seeking -19.41% (Table 4)

Results

Cumulative Return (TSLA, testing)

Value61.7758%

BaselineBuy-and-Hold -18.6312%

Sharpe Ratio (TSLA, testing)

Value2.6789

BaselineBuy-and-Hold -0.5410

Cumulative Return (AMZN, testing)

Value23.2613%

BaselineBuy-and-Hold 14.6949%

SFT

Value34.9832%

BaselineBuy-and-Hold -30.0071%

Backbone comparison (Cumulative Return)

ValueGPT-4 = 62.618%, GPT-4-Turbo = 54.6958%

BaselineGPT-3.5-Turbo = 16.1501%

Working memory TopK effect (Cumulative Return)

ValueTopK=5 -> 54.6958%; TopK=1 -> 52.0936%; TopK=10 -> 79.4448%

BaselineBuy-and-Hold -66.9497%

Who Should Care

What To Try In 7 Days

Prototype a memory+LLM workflow: store summaries in a vector DB (FAISS) and retrieve topK per time-layer.

Run backtests comparing a self-adaptive risk prompt vs fixed risk text prompts on one ticker.

Tune TopK retrieval (start with K=5) and compare cumulative return and drawdown on historical data.

Agent Features

Memory

  • working memory (summarize, observe, reflect)
  • layered long-term memory (shallow/intermediate/deep)
  • promotion of important events between layers

Planning

  • immediate reflection (daily decision)
  • extended reflection (M-day retrospection)

Tool Use

  • FAISS vector DB for retrieval
  • OpenAI embeddings for relevancy scoring
  • Guardrails AI for action validation

Frameworks

  • Generative Agents (inspired)
  • RAG via vector retrieval into LLM prompts

Is Agentic

true

Architectures

  • LLM backbone with modular agent structure
  • profiling + working memory + layered long-term memory
  • decision-making (immediate and extended reflection)

Optimization Features

Token Efficiency

  • summaries fed into memory reduce raw prompt size

Training Optimization

  • short training window (6 months) shown sufficient in experiments

Reproducibility

Code Urls

  • Source code referenced in paper: 'FINMEM LLM Trading'

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Results come from historical backtests; no live trading or transaction-cost analysis.
  • Experiments use general-purpose LLMs and limited data quality; performance may change with market regimes.
  • Dependence on commercial LLMs (GPT-4 family) affects cost and reproducibility.
  • Memory scoring and K tuning may overfit to tested tickers and time windows.

When Not To Use

  • For high-frequency or tick-level trading where millisecond latency matters.
  • When transaction costs and slippage are critical and not simulated.
  • If you cannot afford the cost of high-capability LLM API calls.

Failure Modes

  • LLM hallucinations or incorrect summarizations leading to wrong trades.
  • Memory-promotion rules elevating misleading events and causing persistent bias.
  • Overfitting TopK and risk rules to historical windows not generalizing forward.
  • Backtest-to-live gap due to missing transaction cost and market impact modeling.

Core Entities

Models

  • GPT-4
  • GPT-4-Turbo
  • GPT-3.5-Turbo
  • davinci-003
  • Llama2-70b-chat
  • FinGPT
  • Generative Agents
  • PPO
  • DQN
  • A2C
  • Buy-and-Hold

Metrics

  • Cumulative Return
  • Sharpe Ratio
  • Daily Volatility
  • Annualized Volatility
  • Max Drawdown

Datasets

  • Yahoo Finance OHLCV
  • Alpaca News API (Benzinga)
  • Company 10-Q and 10-K filings
  • FINMEM Layered Long-term Memory (FAISS)