NinjaLLM: cost-focused RAG on AWS Trainium with near‑GPT‑4 accuracy

July 11, 20247 min

Overview

Production Readiness

0.7

Novelty Score

0.5

Cost Impact Score

0.8

Citation Count

1

Authors

Tengfei Xue, Xuefeng Li, Roman Smirnov, Tahir Azim, Arash Sadrieh, Babak Pahlavan

Links

Abstract / PDF

Why It Matters For Business

You can deploy competitive RAG QA at lower cloud cost by using Trainium/Inferentia2 plus serving optimizations, enabling cheaper fine-tuning and elastic scaling for production assistants.

Summary TLDR

This paper describes a practical RAG (retrieval-augmented generation) system built around a fine-tuned Llama3-Instruct 70B model hosted on AWS Trainium/Inferentia2 via SageMaker. It combines a lightweight fine-tune (≈20M tokens), vLLM hosting optimizations (paged attention, multi-bucketing, continuous batching), and ColBERTv2 retrieval to reach 62.22% on Natural Questions and 58.84% on HotPotQA. The authors emphasize lower hosting/fine-tuning cost and elastic scaling on AWS, with engineering changes aimed at reducing Time-to-First-Token and throughput waste.

Problem Statement

RAG systems often cost too much to train and serve on GPU infrastructure, struggle with citation and hallucination control, and need more efficient inference to be practical in production. The paper aims to build a cheaper, scalable RAG stack on AWS Trainium/Inferentia2 that keeps accuracy competitive while adding citations and hallucination checks.

Main Contribution

A deployed RAG stack using Llama3-Instruct 70B fine-tuned on Trainium and served via SageMaker.

Engineering techniques for serving large models: vLLM memory management, multi-bucketing, and token-level continuous batching.

Switch to ColBERTv2 for passage ranking and top-10 passage retrieval in the RAG pipeline.

Cost-focused fine-tuning workflow (≈20M tokens, Lima-style) reported as inexpensive and fast on TRN1 instances.

Safety-focused additions: format enforcement, prompt engineering, and model response checking to reduce hallucinations and unsafe answers.

Key Findings

NinjaLLM matches or exceeds several open LLM baselines on two QA benchmarks.

NumbersNQ 62.22%, HotPotQA 58.84% (Table 1)

Performance is close to but below GPT‑4 Turbo on evaluated QA tasks.

NumbersGPT‑4 Turbo: NQ 63.90%, HotPotQA 62.90% (Table 1)

Fine-tuning can be fast and relatively cheap when using small, targeted datasets.

Numbers≈20M tokens fine-tuned in <3 hours on 32 TRN1 instances; sample-tune cost < $1,000; full iterative effort < $30,000

Serving optimizations reduce wasted compute and TTFT for long-context LLaMA3 models.

NumbersTechniques: PagedAttention, multi-bucketing (avoid prefill to max 8192), continuous batching (Section 3)

Retrieval ranking choice affects pipeline accuracy.

NumbersColBERTv2 used for filtering/ranking vs bge-large used in prior work (Section 5)

Safety and hallucination controls are handled via fine-tuning and runtime checks.

NumbersAdded prompt engineering and model response checking mechanisms (Section 2)

Results

Accuracy

Value62.22%

BaselineGPT-4 Turbo 63.90%

Accuracy

Value58.84%

BaselineGPT-4 Turbo 62.90%

fine-tune time

Value<3 hours

fine-tune sample cost

Value< $1,000

Who Should Care

What To Try In 7 Days

Run a small Lima-style fine-tune (≈20M tokens) on Trainium to test model style and citation output.

Deploy vLLM with multi-bucketing and continuous batching on SageMaker to measure TTFT and throughput gains.

Swap retrieval to ColBERTv2 and compare end-to-end QA accuracy on a sample of your queries.

Agent Features

Tool Use

  • External retrieval and citation
  • Runtime response checking

Frameworks

  • SageMaker
  • vLLM

Optimization Features

Token Efficiency

  • Bucket selection to reduce prefill tokens

Infra Optimization

  • Deploy on AWS Trainium / Inferentia2 for cost-effective compute

Model Optimization

  • Fine-tune Llama3-Instruct 70B on Trainium

System Optimization

  • Use SageMaker for autoscaling and A/B testing

Training Optimization

  • Lima-style small-sample fine-tuning (≈20M tokens)
  • Elastic TRN1 clusters to handle bursty trials

Inference Optimization

  • vLLM: PagedAttention and block-level memory
  • Multi-bucketing to avoid max-length prefill
  • Token-level continuous batching to improve throughput

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Benchmarks limited to NQ and HotPotQA; no user studies or wider domain tests.
  • No public code or reproducible deployment scripts provided.
  • Cost claims lack detailed breakdowns for serving and inference.
  • Safety checks are engineering-level and not measured on dedicated safety benchmarks.

When Not To Use

  • When you need the absolute top benchmark leader (GPT‑4 Turbo still leads).
  • If you cannot use AWS Trainium/Inferentia2 or SageMaker.
  • When you require audited, large-scale safety evaluations before production.

Failure Modes

  • Residual hallucinations despite response checks.
  • Fine-tune instability due to small sample distributions; requires multiple trials.
  • Bucket selection mistakes could truncate or mis-handle long contexts.
  • Unmeasured cost surprises at large scale due to missing serving numbers.

Core Entities

Models

  • Llama3-Instruct-70B
  • GPT-4 Turbo
  • DBRX
  • Mixtral Instruct
  • GPT-3.5 Turbo
  • Llama2-70B

Metrics

  • Accuracy
  • Time-to-First-Token
  • throughput
  • latency

Datasets

  • Natural Questions
  • HotPotQA
  • Wikipedia corpus

Benchmarks

  • Natural Questions
  • HotPotQA