TigerBot: an openly released 7B–180B multilingual LLM family with emphasis on Chinese, low training cost, long context and practical tools

December 14, 20238 min

Overview

Decision SnapshotNeeds Validation

Scores reflect solid engineering and empirical benchmark gains, clear deployment tooling and quantization wins; novelty is moderate because methods combine existing ingredients; evidence uses internal benchmarks and engineering reports with some public reproducibility.

Citations2

Evidence Strength0.70

Confidence0.78

Risk Signals11

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Partial assets available

Open source: Partial

License: Apache-2.0 (note: model continued from Llama-2/BLOOM; check upstream licenses)

At A Glance

Cost impact: 70%

Production readiness: 80%

Novelty: 50%

Authors

Ye Chen, Wei Cai, Liangmin Wu, Xiaowei Li, Zhanxuan Xin, Cong Fu

Links

Abstract / PDF / Code

Why It Matters For Business

TigerBot gives better Chinese and competitive English performance with practical tooling (APIs, plugins, long-context, function calling) and low claimed training cost, making it useful for production chat, document QA, and device embedding.

Who Should Care

Summary TLDR

TigerBot is an open-source family of decoder-only LLMs (7B, 13B, 70B, 180B) built mainly from Llama-2 and BLOOM. The team focused on high-quality multilingual data (zh:en ≈ 5:5), efficient training/inference, instruction alignment (SFT + RLHF/DPO), long-context extrapolation to 32k tokens, and a tool stack (plugins, search, function calling). On their benchmarks TigerBot outperforms comparable open models (roughly +4.3 points English chat average; +13.0 points Chinese base average). They released models and tooling under Apache-2.0 but note license/continuation caveats from upstream models.

Problem Statement

Open LLMs often lag in non-English coverage, cost-effective training, deployment tooling, long-context handling, and practical safety mechanisms. The goal is to deliver competitive open multilingual models while keeping training cost and infrastructure affordable and providing practical application tools.

Main Contribution

A released family of open LLMs at 7B/13B/70B/180B with base and chat variants and plugin/API support.

A curated multilingual pretraining mix (~500B tokens; zh:en ≈ 5:5) and 5M SFT + 15k RLHF comparison data for alignment.

Key Findings

TigerBot improves over Llama-2 on evaluated benchmarks.

NumbersEnglish chat avg 69.87 vs 65.62 (+4.25 points); Chinese base avg 65.26 vs 52.27 (+12.99)

Practical UseExpect better out-of-the-box English and notably better Chinese performance than Llama-2 on standard benchmarks; test on your tasks before deployment.

Evidence RefTables 3 & 4

Quantized TigerBot models give major resource wins with little accuracy loss.

NumbersUp to speedup and memory reduction using 4-bit ExLlamaV2 quantization

Practical UseUse 4-bit static quantization for faster, cheaper serving; validate accuracy on your use cases.

Evidence RefQuantization section

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
English chat average (selected benchmarks)69.87Llama-2 chat average 65.62+4.25Aggregated chat benchmarks in Table 4Table 4 TigerBot 70B-chat vs Llama-2 70B-chatTable 4
Chinese base average (selected benchmarks)65.26Llama-2 base average 52.27+12.99Aggregated base benchmarks in Table 3Table 3 TigerBot 70B-base vs Llama-2 70B-baseTable 3

What To Try In 7 Days

Run TigerBot-13B chat on your Chinese FAQs and compare answers vs current model.

Quantize a TigerBot model with ExLlamaV2 and measure latency/memory gains on target infra.

Test 32k long-context QA on a representative document to replace a two-stage retrieval pipeline.

Agent Features

Memory
long-context up to 32k tokens
Tool Use
search plugindocument pluginfunction callingimage in/out
Frameworks
Megatron-DeepSpeedHuggingFace TransformersTGI / vLLM
Architectures
decoder-only transformerRoPEALiBigrouped-query attention (GQA)

Optimization Features

Token Efficiency
Holistic pretraining: 2–5% instruction-like data mixed inSFT
Infra Optimization
Cluster: 512× A100-40G GPUs with NVLink and RoCEper-node 1024GB RAM to avoid CPU offload
Model Optimization
FlashAttentionGrouped-query attention (GQA)SwiGLU activations
System Optimization
private Megatron-DeepSpeed fork with weight conversion scriptstuned TP/PP/DP configs (e.g., TP=2, PP=8, DP=2 for 13B)
Training Optimization
3D parallelism (TP/PP/DP/ZeRO)pipeline partition algorithmgradient accumulation and checkpointingmixed bfloat16 training
Inference Optimization
Static 4-bit quantization (ExLlamaV2)Dynamic W8A16 quantizationoptimized inference engines (TGI, vLLM) with KV cache

Reproducibility

Code AvailableYes
Data AvailableNo
Open Source StatusPartial
LicenseApache-2.0 (note: model continued from Llama-2/BLOOM; check upstream licenses)

Risks & Boundaries

Limitations

Some training data and preprocessing are proprietary, limiting full replication.

180B model initialized from BLOOM and others from Llama-2 — check upstream license constraints.

When Not To Use

Mission-critical systems requiring certified guarantees or formal verification

Regulated use-cases where provenance and full dataset transparency are required

Failure Modes

Hallucinations on unsupported facts or when retrieval/filtering fails

Performance depends on data quality; a few bad examples can degrade outputs

Core Entities

Models

TigerBot-7BTigerBot-13BTigerBot-70BTigerBot-180BLlama-2 (continual pretraining source)BLOOM (pretraining source for 180B)

Metrics

AccuracyHumanEval pass-rate (reported as %/points)inference speedup (×)memory reduction (×)

Datasets

RefinedWebC4OpenWebTextBookCorpusWuDaoWanJuanGitHubStack OverflowarXivEnglish/WikipediaChinese Baike

Benchmarks

HumanEvalMMLUGSM8KPIQAHellaSwagBoolQCMRCC-EVALOCNLI