A manager–analyst LLM multi-agent that uses verbalized, episode-level belief updates (CVRF) plus daily CVaR alerts to improve trading and (小

July 9, 20249 min

Overview

Production Readiness

0.45

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

11

Authors

Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yupeng Cao, Zhi Chen, Jordan W. Suchow, Rong Liu, Zhenyu Cui, Zhaozhuo Xu, Denghui Zhang, Koduvayur Subbalakshmi, Guojun Xiong, Yueru He, Jimin Huang, Dong Li, Qianqian Xie

Links

Abstract / PDF

Why It Matters For Business

FINCON shows that structuring LLMs like a small investment team plus two-tiered risk controls can raise backtested returns and Sharpe ratios while reducing chatter. This suggests a practical path for building LLM-based decision pipelines for small active portfolios and research prototypes.

Summary TLDR

FINCON is a multi-agent system that organizes specialist LLM analysts under a single manager. It adds a dual-level risk-control: a daily CVaR alert that forces risk-averse actions within episodes, and Conceptual Verbal Reinforcement (CVRF) that updates manager and analyst prompts across episodes using overlap-based textual gradient steps. On backtests (Jan 2022–Jun 2023, GPT-4-Turbo backbone) FINCON outperforms several LLM and DRL baselines on cumulative return and Sharpe ratio for single-stock trading and small portfolio tasks. The system reduces peer-to-peer chatter by selective belief propagation and uses procedural/episodic memory for retrieval.

Problem Statement

LLM agents can trade but they struggle to (1) integrate many data sources cheaply, (2) control risk over time, and (3) refine policies across episodes using natural-language updates. Existing systems either overload one agent or require high communication and lack long-term belief updates.

Main Contribution

A Manager–Analyst hierarchy that assigns uni-modal analyst agents (news, filings, audio, tabular) to distill signals and a single manager that makes trades.

A dual-level risk-control: within-episode CVaR alerts for same-day risk aversion and over-episode Conceptual Verbal Reinforcement (CVRF) that updates prompts using episode comparisons and an overlap-based learning rate.

A modular agent design with working/procedural/episodic memory and selective propagation of conceptual beliefs to limit unnecessary agent-to-agent communication.

Empirical evaluation on multi-modal backtests showing higher cumulative returns and Sharpe ratios versus LLM and DRL baselines for single-stock and small portfolio tasks.

Key Findings

FINCON produces much higher cumulative returns on tested stocks than baselines.

NumbersTSLA CR 82.871% vs buy-and-hold 6.425% (Table 2)

FINCON yields strong risk-adjusted returns on portfolios in tests.

NumbersPortfolio1 CR 113.836%, SR 3.269 vs Markowitz CR 12.636%, SR 0.614 (Table 3)

Within-episode CVaR alerts improve stability and returns.

NumbersPortfolio CR rose 14.699% → 113.836% with CVaR (Table 4)

Over-episode belief updates (CVRF) drive faster learning across episodes.

NumbersTrading-action overlap grew 46.94%→81.63% across 4 episodes; CR improvement vs no belief updates (Table 5)

System is more robust in high-volatility tests than baselines.

NumbersUnder high VIX scenario (TSLA), FINCON CR 22.46% vs Buy-and-Hold −56.738% (Table 7)

Results

TSLA Cumulative Return (CR%)

Value82.871%

BaselineBuy-and-Hold 6.425%

Portfolio1 Cumulative Return (CR%)

Value113.836%

BaselineMarkowitz MV 12.636%

Portfolio1 Sharpe Ratio (SR)

Value3.269

BaselineMarkowitz MV 0.614

Within-episode CVaR impact (Portfolio CR%)

ValueWith CVaR 113.836% vs w/o CVaR 14.699%

Baselinew/o CVaR 14.699%

Belief-update overlap increase

ValueOverlap 46.94%→81.63% across four episodes

Who Should Care

What To Try In 7 Days

Build a manager–analyst prompt layout: assign one LLM agent per data source (news, filings, prices, audio) and a manager to consolidate outputs.

Add a daily CVaR check (tail-average of worst 1% PnL). If CVaR drops, force conservative manager actions that day.

Implement simple episodic belief updates: compare two recent trajectories, distill top winning/losing reasons, and edit manager prompt by a small overlap-based step.

Agent Features

Memory

  • Working memory (short-term summarization)
  • Procedural memory (per-step records, decay by agent)
  • Episodic memory (manager-level trajectories and beliefs)

Planning

  • Manager consolidates analyst outputs into sequential trading actions
  • Episode-level self-reflection (manager)

Tool Use

  • Audio transcription (Whisper)
  • External convex optimizer for portfolio weights
  • Memory retrieval and Guardrails AI

Frameworks

  • Textual gradient-descent style prompt optimizer (CVRF)
  • CVaR-based within-episode risk control

Is Agentic

true

Architectures

  • Manager-Analyst hierarchy
  • Uni-modal specialist analyst agents

Collaboration

  • Hierarchical synthesis (manager as sole decision maker)
  • Selective back-propagation of conceptual beliefs to relevant analysts

Optimization Features

Token Efficiency

  • Reduced peer-to-peer communication to save token and latency cost

System Optimization

  • Selective propagation of conceptual beliefs to limit messages

Training Optimization

  • Textual prompt updates using overlap-based learning rate

Reproducibility

Code Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Occasional hallucinations (e.g., non-existent memory indices) noted in portfolio tasks.
  • Demonstrated only on small portfolios (3 assets); scaling to tens of assets not validated.
  • Relies on GPT-4-Turbo (cost and latency concerns for production).
  • Backtests confined to one historical window; live-market behavior uncertain.

When Not To Use

  • As-is for large-scale portfolios (tens of assets) without redesign for context limits.
  • Direct live deployment with real capital before careful out-of-sample and paper trading.
  • Settings that require low-cost, low-latency inference at scale using heavy LLMs.

Failure Modes

  • Hallucinated factual outputs leading to erroneous trades in long-context, multi-asset cases.
  • Outdated or misleading memory retrieval if procedural memory ranking is wrong.
  • Overfitting to the training window causing poor generalization to unseen regimes.
  • High cost or latency from repeated GPT-4-Turbo calls causing throttling or stale decisions.

Core Entities

Models

  • GPT-4-Turbo (backbone for all LLM agents)
  • Whisper API (audio transcription used by ECC agent)

Metrics

  • Cumulative Return (CR%)
  • Sharpe Ratio (SR)
  • Max Drawdown (MDD%)

Datasets

  • Multi-modal dataset Jan 3 2022 – Jun 10 2023 (Yahoo Finance prices, Refinitiv/Alpaca news, SEC 10-Q/
  • Training period: Jan 3 2022 – Oct 4 2022; Testing: Oct 5 2022 – Jun 10 2023

Benchmarks

  • Buy-and-Hold
  • GENERATIVE AGENT (GA)
  • FINGPT
  • FINMEM
  • FINAGENT
  • A2C
  • PPO
  • DQN
  • MarkowitzMV
  • FinRL-A2C
  • Equal-Weighted ETF