Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.7
Citation Count
5
Why It Matters For Business
CEBench helps teams choose LLM deployments that meet accuracy needs while controlling real costs by comparing local vs online models, RAG vs few-shot, and by recommending Pareto-efficient plans.
Summary TLDR
CEBench is an open-source, configuration-driven toolkit for benchmarking LLM pipelines across both effectiveness (accuracy, F1, MAE) and operational cost (latency, memory, estimated $/prompt). It automates data loading, RAG integration, metric logging, and outputs Pareto-efficient deployment plans. Two demos—mental-health scoring (local LLMs + RAG) and contract review (online LLMs + RAG)—show how RAG and model size affect accuracy, latency, and cost and how the tool recommends trade-offs.
Problem Statement
Practitioners must pick LLM deployment plans that balance model quality and real-world costs. Existing toolkits focus on accuracy but ignore deployment cost, RAG integration, and multi-objective trade-offs, forcing repeated coding and ad-hoc cost estimates.
Main Contribution
An open-source, zero-code toolkit (CEBench) to run batch, multi-objective LLM benchmarks from configuration files.
End-to-end benchmarking that includes local LLMs, RAG pipelines, prompt variants, resource monitoring, and estimated monetary costs.
A plan recommender that builds Pareto fronts over cost, latency, and effectiveness to suggest deployment options.
Two concrete use cases (mental-health scoring and contract review) that demonstrate cost-effectiveness trade-offs between local and online LLMs.
Key Findings
Lightweight online model plus RAG yields very high accuracy at minimal cost.
RAG often reduces end-to-end latency vs few-shot prompting by shortening input size.
Local large model gives best accuracy but costs more; smaller local models are far cheaper with small quality loss.
Embedding quantization had little effect on pipeline effectiveness in the mental-health demo.
Latency-to-cost estimates are coarse when based only on GPU TFLOPs ratios.
Results
ContractNLI F1 (RAG)
ContractNLI latency (avg)
Mental-health MAE vs estimated cost
Model MAE / Specificity (selected local models)
Who Should Care
What To Try In 7 Days
Install CEBench and run one config comparing a small local model and a small online model on your task.
Benchmark both few-shot and a simple RAG pipeline to measure token lengths, latency, and cost per prompt.
Use the plan recommender to generate a Pareto front and validate the top candidate on target hardware.
Optimization Features
Token Efficiency
- RAG reduces input token length vs few-shot
Infra Optimization
- map benchmarked GPU TFLOPs to target instance TFLOPs
Model Optimization
- quantization (scalar, product, none)
System Optimization
- estimate latency via TFLOPs ratio for instance selection
Inference Optimization
- choose smaller models for latency/cost trade-offs
- RAG to reduce input tokens
Reproducibility
Data Urls
- DAIC-WOZ (public dataset)
- ContractNLI (public dataset)
Code Available
Data Available
Open Source Status
- yes
Risks & Boundaries
Limitations
- Latency and cost estimates are coarse because they are based on GPU TFLOPs ratios, which miss system-level differences.
- Benchmarks reflect chosen datasets and configurations (ContractNLI, DAIC-WOZ); results may not transfer to other tasks.
- Large models can hit out-of-memory errors during prompt loading; real deployments may need memory tuning or model sharding.
When Not To Use
- When you require highly accurate latency prediction for specific hardware (CEBench uses TFLOPs-based estimates).
- When the evaluation focuses only on raw model research (CEBench emphasizes cost-effectiveness over pure model-score analysis).
- If your pipeline cannot use vector DBs or RAG workflows (CEBench centers RAG support).
Failure Modes
- Cost/latency recommendations off due to TFLOPs-to-latency mismatch on target hardware.
- Out-of-memory during model loading for very large models (example: mixtral:7x22b OOM).
- RAG settings (chunk size, top-k) may harm some models' effectiveness if tuned without task validation.
Core Entities
Models
- llama3:8b
- llama3:70b
- mixtral:8x7b
- mixtral:8x22b
- llama2:7b
- llama2:13b
- llama2:70b
- Haiku
- Sonnet
- Opus
- GPT-4
Metrics
- MAE
- F1
- Specificity
- End-to-End Latency (s)
- Cost per 1k prompts ($/1k)
- TFLOPs
Datasets
- DAIC-WOZ
- ContractNLI
Benchmarks
- Contract NLI
- Mental-health questionnaire scoring

