Reprogram frozen LLMs to forecast time series using text prototypes and Prompt-as-Prefix

October 3, 20238 min

Overview

Decision SnapshotNeeds Validation

Method is practical: small extra module, public code, and broad benchmark evaluation support applying this in production-like settings where LLMs are available.

Citations127

Evidence Strength0.70

Confidence0.85

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 70%

Production readiness: 70%

Novelty: 60%

Authors

Ming Jin, Shiyu Wang, Lintao Ma, Zhixuan Chu, James Y. Zhang, Xiaoming Shi, Pin-Yu Chen, Yuxuan Liang, Yuan-Fang Li, Shirui Pan, Qingsong Wen

Links

Abstract / PDF / Code / Data

Why It Matters For Business

You can add time series forecasting to an existing LLM deployment with little extra training and often better accuracy in low-data and cross-domain cases.

Who Should Care

Summary TLDR

This paper shows you can reuse a frozen large language model (LLM) for time series forecasting by converting local time windows into a small set of learned "text prototypes" and adding a structured natural-language prefix (Prompt-as-Prefix). The lightweight reprogramming network (<6.6M trainable params) trains fast and keeps the backbone LLM frozen. On standard benchmarks (ETT, Weather, Electricity, Traffic, M4) TIME-LLM matches or outperforms specialized forecasting models and recent LLM-based baselines, particularly in few-shot and zero-shot settings.

Problem Statement

Time series models are usually task-specific and need lots of data. LLMs are powerful but operate on discrete tokens and were not designed for continuous time series. The challenge is aligning continuous time series with LLMs without expensive fine-tuning of the backbone.

Main Contribution

A reprogramming method that maps time series patches into a small set of learned text prototypes so a frozen LLM can process them.

Prompt-as-Prefix (PaP): add structured natural-language context (dataset caption, task instruction, input statistics) to guide the LLM’s transformation of patches.

Key Findings

TIME-LLM improves average long-term MSE over a fine-tuned LLM baseline (GPT4TS).

Numbers≈12% average MSE reduction vs GPT4TS on evaluated long-term benchmarks

Practical UseIf you already consider LLM-based forecasting, reprogramming with PaP can give noticeably better accuracy without fine-tuning the backbone.

Evidence RefSec.4.1; Table 1

TIME-LLM yields state-of-the-art results in most long-term cases.

NumbersSOTA in 36 out of 40 long-term instances (across eight benchmarks)

Practical UseFor many standard forecasting benchmarks, you can replace task-specific models with a reprogrammed LLM and expect top-tier accuracy.

Evidence RefSec.D.1; Table 10

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Long-term MSE (vs GPT4TS)≈12% average reductionGPT4TS≈12%ETT / Weather / Electricity / Traffic / ILI (aggregated)Sec.4.1; Table 1Table 1
Short-term SMAPE (M4 benchmark)11.983 (TIME-LLM)GPT4TS 12.690≈5.5% relative SMAPE improvementM4 (aggregated)Sec.4.2; Table 2Table 2

What To Try In 7 Days

Run the TIME-LLM code on one of your standard forecasting datasets (e.g., hourly electricity) using a frozen Llama-7B backbone.

Replace cluster-dependent fine-tuning with the small reprogramming module (<7M params) to test accuracy and training speed.

Craft Prompt-as-Prefix entries: dataset caption, task instruction, and simple input statistics (trend, top lags) and measure lift versus no-prompt.

Optimization Features

Token Efficiency
Input patching reduces token sequence length compared with raw time steps
Infra Optimization
LoRA
Model Optimization
Keep backbone frozen; only small adapter-like module trained
System Optimization
Compatible with off-the-shelf quantization and PEFT techniques
Training Optimization
Train <6.6M params; batch sizes and epochs small (see Tab.9)
Inference Optimization
Inference cost dominated by backbone LLM; can apply quantization

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Data URLs

ETT, M4 and other public benchmarks referenced in paper (see Sec.B.2)

Risks & Boundaries

Limitations

Depends on access to large pretrained LLMs; backbone inference cost remains high.

Numeric precision and tokenization quirks of LLMs may complicate direct numeric outputs; the method avoids this by projecting LLM outputs but risk remains.

When Not To Use

When strict low-latency or very low-cost inference is required and you cannot host a large LLM.

If you cannot access a suitable backbone LLM or must avoid proprietary/model-license constraints.

Failure Modes

Backbone LLM generates representations insensitive to fine numeric detail, hurting long-horizon precision if projection is misaligned.

Domain shift: prototypes learned on one domain may not generalize to very different series without additional data.

Core Entities

Models

Llama-7BGPT-2GPT4TSLLMTimePatchTSTTimesNetDLinearN-HiTSN-BEATSFEDformerAutoformer

Metrics

MSEMAESMAPEMASEOWA

Datasets

ETTh1ETTh2ETTm1ETTm2WeatherElectricityTrafficILIM4M3-Quarterly

Benchmarks

long-term forecastingshort-term forecastingfew-shot forecastingzero-shot forecasting