NinjaLLM: cost-focused RAG on AWS Trainium with near‑GPT‑4 accuracy

Overview

Decision SnapshotNeeds Validation

The system shows solid engineering and competitive QA accuracy with clear operational methods; evidence is limited to two QA benchmarks and implementation notes rather than broad evaluations.

Citations1

Evidence Strength0.60

Confidence0.60

Risk Signals11

Trust Signals

Findings with numeric evidence: 6/6

Findings with evidence refs: 6/6

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 50%

Authors

Tengfei Xue, Xuefeng Li, Roman Smirnov, Tahir Azim, Arash Sadrieh, Babak Pahlavan

Links

Abstract / PDF

Why It Matters For Business

You can deploy competitive RAG QA at lower cloud cost by using Trainium/Inferentia2 plus serving optimizations, enabling cheaper fine-tuning and elastic scaling for production assistants.

Who Should Care

CTO ML Engineer Product Manager Founder

Summary TLDR

This paper describes a practical RAG (retrieval-augmented generation) system built around a fine-tuned Llama3-Instruct 70B model hosted on AWS Trainium/Inferentia2 via SageMaker. It combines a lightweight fine-tune (≈20M tokens), vLLM hosting optimizations (paged attention, multi-bucketing, continuous batching), and ColBERTv2 retrieval to reach 62.22% on Natural Questions and 58.84% on HotPotQA. The authors emphasize lower hosting/fine-tuning cost and elastic scaling on AWS, with engineering changes aimed at reducing Time-to-First-Token and throughput waste.

Problem Statement

RAG systems often cost too much to train and serve on GPU infrastructure, struggle with citation and hallucination control, and need more efficient inference to be practical in production. The paper aims to build a cheaper, scalable RAG stack on AWS Trainium/Inferentia2 that keeps accuracy competitive while adding citations and hallucination checks.

Main Contribution

A deployed RAG stack using Llama3-Instruct 70B fine-tuned on Trainium and served via SageMaker.

Engineering techniques for serving large models: vLLM memory management, multi-bucketing, and token-level continuous batching.

Key Findings

NinjaLLM matches or exceeds several open LLM baselines on two QA benchmarks.

NumbersNQ 62.22%, HotPotQA 58.84% (Table 1)

Practical UseYou can run a Llama3-based RAG on Trainium with competitive QA accuracy versus Mixtral/DBRX while avoiding paid API costs.

Evidence RefTable 1

Performance is close to but below GPT‑4 Turbo on evaluated QA tasks.

NumbersGPT‑4 Turbo: NQ 63.90%, HotPotQA 62.90% (Table 1)

Practical UseIf absolute top accuracy matters, GPT‑4 class models still lead; NinjaLLM is a cost-performance tradeoff.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	62.22%	GPT-4 Turbo 63.90%	-1.68pp vs GPT-4 Turbo	Natural Questions (NQ Open)	Table 1 reports Ninja LLM 62.22% vs GPT‑4 Turbo 63.90%	Table 1
Accuracy	58.84%	GPT-4 Turbo 62.90%	-4.06pp vs GPT-4 Turbo	HotPotQA	Table 1 reports Ninja LLM 58.84% vs GPT‑4 Turbo 62.90%	Table 1

What To Try In 7 Days

Run a small Lima-style fine-tune (≈20M tokens) on Trainium to test model style and citation output.

Deploy vLLM with multi-bucketing and continuous batching on SageMaker to measure TTFT and throughput gains.

Swap retrieval to ColBERTv2 and compare end-to-end QA accuracy on a sample of your queries.

Agent Features

Tool Use

External retrieval and citationRuntime response checking

Frameworks

SageMakervLLM

Optimization Features

Token Efficiency

Bucket selection to reduce prefill tokens

Infra Optimization

Deploy on AWS Trainium / Inferentia2 for cost-effective compute

Model Optimization

Fine-tune Llama3-Instruct 70B on Trainium

System Optimization

Use SageMaker for autoscaling and A/B testing

Training Optimization

Lima-style small-sample fine-tuning (≈20M tokens)Elastic TRN1 clusters to handle bursty trials

Inference Optimization

vLLM: PagedAttention and block-level memoryMulti-bucketing to avoid max-length prefillToken-level continuous batching to improve throughput

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Risks & Boundaries

Limitations

Benchmarks limited to NQ and HotPotQA; no user studies or wider domain tests.

No public code or reproducible deployment scripts provided.

When Not To Use

When you need the absolute top benchmark leader (GPT‑4 Turbo still leads).

If you cannot use AWS Trainium/Inferentia2 or SageMaker.

Failure Modes

Residual hallucinations despite response checks.

Fine-tune instability due to small sample distributions; requires multiple trials.

Core Entities

Models

Llama3-Instruct-70BGPT-4 TurboDBRXMixtral InstructGPT-3.5 TurboLlama2-70B

Metrics

AccuracyTime-to-First-Tokenthroughputlatency

Datasets

Natural QuestionsHotPotQAWikipedia corpus

Benchmarks

Natural QuestionsHotPotQA

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

NinjaLLM matches or exceeds several open LLM baselines on two QA benchmarks.

Performance is close to but below GPT‑4 Turbo on evaluated QA tasks.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Fine-tune LLMs to ignore misleading retrieved documents and cut RAG hallucinations by ~21%

Key finding

17K open-access synthesis recipes + an LLM-as-a-Judge benchmark to scale materials synthesis evaluation

Key finding

LIT-RAGBench: a 114-item benchmark testing LLM generators' integration, reasoning, table understanding, logic, and abstention in RAG

Key finding

RAGElo: use synthetic queries + LLM-as-judge + Elo tournaments to compare RAG vs RAG-Fusion on company docs

Key finding

First benchmark and toolkit to test RAG for multi-turn Chinese legal consultations

Key finding