One-click evaluation, automated ensembles, and LLM-powered Q&A for time series forecasting

Overview

Decision SnapshotNeeds Validation

System demo built on an existing open benchmark with concrete modules, but no end-to-end production deployment evidence is provided.

Citations0

Evidence Strength0.60

Confidence0.70

Risk Signals8

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 0/3

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 50%

Authors

Xiangfei Qiu, Xiuwen Li, Ruiyang Pang, Zhicheng Pan, Xingjian Wu, Liu Yang, Jilin Hu, Yang Shu, Xuesong Lu, Chengcheng Yang, Chenjuan Guo, Aoying Zhou, Christian S. Jensen, Bin Yang

Links

Abstract / PDF / Data

Why It Matters For Business

EasyTime speeds method evaluation and selection by reusing a large benchmark and automating ensembles, reducing experiment time and guesswork for forecasting projects.

Who Should Care

Product Manager ML Engineer Data Scientist CTO

Summary TLDR

EasyTime is a demo system that makes time-series forecasting easier for researchers and practitioners. It builds on the TFB benchmark (25 multivariate and 8,068 univariate datasets) and 30+ methods to provide: one-click evaluation for new algorithms, an Automated Ensemble that recommends and combines top methods using TS2Vec + a method-ranking classifier, and a natural-language Q&A module that converts questions to SQL, retrieves benchmark results, and returns answers, charts, and SQL. The system is aimed at faster evaluation, easier model selection, and interactive exploration of historical benchmarking evidence.

Problem Statement

Researchers and practitioners face three pain points: (1) evaluating forecasting methods comprehensively is time-consuming and error-prone; (2) picking suitable methods for a new dataset is hard because no single method wins everywhere; (3) querying benchmarking results or getting practical guidance requires technical effort or expert knowledge.

Main Contribution

A one-click evaluation layer that runs a method across TFB’s diverse datasets and standardized pipelines.

An Automated Ensemble module that uses TS2Vec features and a pre-trained classifier (soft-label loss) to recommend top-k methods and learn ensemble weights on the target data.

Key Findings

TFB contains broad data and precomputed results.

Numbers25 multivariate datasets; 8,068 univariate datasets; 8,000+ series with results

Practical UseYou can evaluate methods across many domains without building datasets from scratch.

Evidence RefSection II.A

Benchmark includes many methods and accumulated results.

Numbers30+ TSF methods evaluated on 8,000+ time series

Practical UseUse the benchmark results as a knowledge base to guide method choice and form ensembles.

Evidence RefAbstract, II.A

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Coverage	30+ methods evaluated	—	—	TFB (8,000+ series)	Benchmark results collected from evaluating 30+ methods on 8,000+ time series	II.A
Dataset counts	25 multivariate; 8,068 univariate	—	—	TFB	Data layer includes 25 multivariate and 8,068 univariate datasets	II.A

What To Try In 7 Days

Embed one of your models into the TFB pipeline and run one-click evaluation.

Upload a real dataset and click 'Recommend Method' to see top-k candidates.

Try the 'AutoML' ensemble flow to compare ensemble vs single methods on your data.

Agent Features

Memory

benchmark knowledge base

Tool Use

LLM for NL2SQLSQL database for retrievalTS2Vec for feature extraction

Frameworks

TFBDartsTSLib

Architectures

offline pretraining + online inference

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://github.com/decisionintelligence/TFB

Risks & Boundaries

Limitations

Quality of recommendations depends on TFB’s dataset coverage; out-of-distribution datasets may not be well served.

Automated Ensemble relies on classifier trained on historical results and may misrank novel-method behaviors.

When Not To Use

When your dataset has patterns not represented in TFB and bespoke modeling is required.

When strict, validated probabilistic forecasting guarantees are required beyond empirical ensembles.

Failure Modes

Classifier recommends poor models for truly novel datasets.

Generated SQL is incorrect and returns misleading data if verification fails.

Core Entities

Models

TS2VecNLinearPatchTSTFiLMTimesNetDLinearLinearMICN

Metrics

MAEcustom metrics (supported)

Datasets

TFBtrafficelectricityenergyenvironmentnatureeconomicstockbankinghealthweb

Benchmarks

TFB

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

TFB contains broad data and precomputed results.

Benchmark includes many methods and accumulated results.

Results

What To Try In 7 Days

Agent Features

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

A realistic benchmark and frozen-web environment for testing web research agents

Key finding

GeneAgent: an LLM agent that queries biology databases to verify and improve gene‑set function explanations

Key finding

Route simple queries straight to fast tools; use memory + planner only for complex job-career requests to cut latency and improve accuracy.

Key finding

SWAN: the first benchmark and baselines for mixing SQL databases with LLMs

Key finding

DQABench: a 200k QA benchmark and modular testbed to measure LLMs on real database questions

Key finding