LLM Architectures Papers — Parsed & Scored for Practitioners

A 50B-parameter LLM trained on ~700B tokens, specialized for financial NLP

0.60

0.45

0.80

299

A mid-size LLM trained with a large curated finance corpus yields big real-world gains on finance tasks while staying useful on general tasks, so firms can get domain accuracy without running huge models.

Key finding

Mixed training (curated finance + public data) yields strong finance performance without losing general abilities

Numbers: Training corpus: 363B financial + 345B public ≈ 709B tokens; trained on 569B tokens

MEDITRON: open-source 7B and 70B medical LLMs trained on a 48B-token curated medical corpus

0.30

0.60

0.50

117

MEDITRON offers a strong, open-source medical LLM that rivals much larger closed models on standard benchmarks, enabling in-house finetuning, auditing, and deployment experiments while avoiding vendor lock-in—though it is not yet production-ready for clinical use.

Key finding

MEDITRON obtains consistent accuracy gains on medical benchmarks over open baselines.

Numbers: Avg accuracy +6% vs best public baseline in class; +3% vs finetuned Llama-2 (reported)

MLA + DeepSeekMoE: a 236B MoE LLM with 21B active params, 128K context, 42.5% training savings

0.70

0.80

97

DeepSeek-V2 shows you can run a very large-capacity model but only activate ~21B params per token, cutting training GPU-hours and inference memory. That lowers operational cost and lets you serve longer contexts or larger batches on the same hardware.

Key finding

DeepSeek-V2 activates far fewer parameters per token while keeping strong accuracy.

Numbers: 236B total / 21B activated params

A concise roadmap to multimodal LLMs: architectures, training recipes, evaluation, hallucination, and extensions

0.60

0.40

0.60

85

MLLMs let products combine vision and language: build image-aware assistants, document parsers, or multimodal agents. Focus on data quality, connector design, and safe alignment to reduce hallucinations before shipping.

Key finding

MLLMs are typically built from three modules: a pre-trained modality encoder, a pre-trained LLM, and a connector between them.

A practical, up-to-date survey of LLMs focused on generating code from natural language

0.70

0.60

0.80

54

Code LLMs can speed development, automate routine coding, and augment junior engineers; open-source instruct-tuned models now match many closed APIs on standard tasks, making in-house deployments feasible while highlighting the need to evaluate on real repo-scale work and safety constraints.

Key finding

Models improved dramatically on small-function benchmarks over recent years.

Numbers: HumanEval pass@1 rose from 3.6% (PaLM 8B) to 95.1% (LDB) as reported in the survey

Large multilingual evaluation shows ChatGPT is strong at grammar but weak at multilingual semantic tasks

0.30

0.40

0.60

51

ChatGPT zero-shot is good for quick grammar-level tasks (like POS tagging) but not reliable for production semantic tasks across many languages; invest in task- and language-specific models for higher accuracy and lower operational risk.

Key finding

ChatGPT generally underperforms supervised task-specific models on semantic multilingual tasks.

Numbers: XNLI avg acc: ChatGPT (en) 57.0% vs mT5-XXL 87.1%

PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

0.60

0.50

43

Open domain-tuned models and labeled instruction data lower the bar to build finance-specific AI: cheaper customization, reproducible evaluation, and better performance on common text tasks; numeric QA and trading signals still need extra work.

Key finding

They built FIT with 136,609 instruction‑tuning examples across 5 tasks and 9 datasets.

Numbers: 136,609 samples; 5 tasks; 9 datasets

ChemLLM: a 7B chemistry-tuned LLM with ChemData (7M Q&A) and ChemBench (4.1k MCQs), matching GPT-4 on core chemical tasks

0.60

0.70

40

A domain-tuned 7B model can match or beat much larger closed models on key chemistry tasks, enabling lower-cost deployment of chemistry assistants and search tools for R&D teams.

Key finding

ChemData size and scope

Numbers: 7M instruction Q&A (authors' dataset summary)

Jamba: hybrid Transformer + Mamba + MoE that fits long contexts in one 80GB GPU

0.30

0.80

0.70

40

Jamba lets teams process much longer documents on standard GPUs while cutting memory needs and speeding up inference; that reduces infrastructure cost for long-document products and enables new features like huge-context summarization.

Key finding

Hybrid Jamba reduces KV cache for 256K tokens to 4GB.

Numbers: KV cache (256K, 16bit): Jamba 4GB vs Mixtral 32GB vs Llama‑2 128GB

Clinical Camel: an open medical LLM fine-tuned with dialogue synthesis and single‑GPU QLoRA

0.20

0.60

0.40

35

An open, high-performing medical LLM reduces vendor lock-in, enables internal validation, and can be reproduced with modest compute, letting institutions experiment safely before any clinical adoption.

Key finding

Clinical Camel-70B beats GPT-3.5 on several medical QA benchmarks in five-shot tests.

Numbers: USMLE 64.3% vs GPT-3.5 58.5%; PubMedQA 77.9% vs 60.2%

A practical map of how knowledge graphs and multimodal AI fit together today and where to push next

0.60

0.50

0.60

28

Adding structured knowledge to multimodal systems improves accuracy, interpretability, and long-tail reasoning. That helps applications like search, recommendation, product QA, and compliance where factual grounding and rare facts matter.

Key finding

The survey covers more than 300 related papers.

Numbers: ‘over 300 articles’ (abstract)

HaELM: an LLM-based, low-cost evaluator to detect and analyze hallucinations in vision-language models

0.50

0.60

0.70

26

Hallucinations make multimodal systems unreliable and risky. HaELM offers a cheaper, local way to measure hallucination and run repeated checks without sending data to external APIs.

Key finding

Object-query tests trigger affirmation bias: models answer "yes" >80% for absent objects but real caption hallucination is <10%.

Numbers: AY >80%; CH <10% (Figure 2, Appendix Tables 9-11)

A domain-tuned LLaMA-65B (InvestLM) for finance that boosts financial NLP and matches many commercial LLMs in expert judgment.

0.60

0.40

0.60

24

A small, high-quality instruction set can turn an open foundation model into a capable finance assistant, offering a lower-cost, open alternative to closed commercial finance LLMs while enabling on-premise control and inspection.

Key finding

Instruction-tuning LLaMA-65B with ~1,300 curated finance instructions improves most finance tasks.

Numbers: 8 of 9 tasks: InvestLM > LLaMA-65B (Table 3); FinSent 0.71→0.79

Instruction tuning unlocks Mixture-of-Experts: similar or better accuracy at ~1/3 the compute

0.70

0.50

0.80

20

Combine instruction tuning with MoE to cut runtime compute and costs: MoE models can match or beat dense baselines while using much less per-token FLOPs, so this reduces inference cost without sacrificing accuracy on many English tasks.

Key finding

Instruction tuning increases MoE gains vs dense models.

Numbers: 7.1% absolute gain on MMLU-Direct (avg) for FLAN‑MOE over dense at similar FLOPs

Train a vision-language model to read and reason across many images in one prompt

0.60

0.70

0.50

18

If your product must reason over multiple images together (multi-photo chat, visual QA over albums, video snapshots), MMICL-style models reduce hallucinations and improve multi-image reasoning by adding explicit image tokens and multi-image instruction tuning.

Key finding

MMICL improves matching of captions to images on compositional image/text puzzles (Winoground).

Numbers: Text 45 / Image 45 / Group 43 (MMICL FLAN-T5-XXL, Table 2)

Systematic evaluation of GPT-4V and LLaVA on 1000+ vision+text engineering design tasks

0.30

0.60

0.50

17

VLMs like GPT-4V can speed up low-value, repetitive visual tasks (sketch similarity, captioning with handwriting) and help populate searchable design catalogs, but they currently cannot replace engineering checks that need precise spatial, numeric, or manufacturability guarantees.

Key finding

GPT-4V matches or exceeds human raters on sketch-similarity triplet tests.

Numbers: Self-consistency 94%; transitive violations = 5 (best human = 5).

WebAgent: combine an HTML-specialist LLM and a code LLM to plan, summarize long pages, and act by generating Python for real websites

0.60

0.70

0.60

16

WebAgent shows a practical path to robust web automation: use a small specialist model to understand long HTML and a capable code-generating LLM to act. That reduces brittle failures on real sites and drastically raises task success in human-supervised runs.

Key finding

Modular WebAgent dramatically improves real-site success rates.

Numbers: Success: real-estate 65% vs 10%; social-media 70% vs 20%; map 80% vs 10%

MiniGPT-5: fuse an LLM with Stable Diffusion using 'generative vokens' for interleaved image+text outputs

0.60

0.70

0.60

16

MiniGPT-5 lets a single system produce coherent text and images together, cutting the need for separate caption→image pipelines and reducing integration overhead, while training only a small set of parameters.

Key finding

Humans prefer MiniGPT-5 outputs over a two-stage baseline on multimodal story tasks.

Numbers: Language continuity 55.22% vs 34.89%; image quality 52.43% vs 37.79%; multimodal coherence 56.9% vs 28.88%

Survey of 126 multimodal LLMs: architectures, training recipes, benchmarks, and next steps

0.70

0.45

0.65

15

You can add vision, audio, or other modalities to existing LLMs cheaply by training small projectors or PEFT adapters, unlocking richer user interactions without retraining huge models.

Key finding

Most MM-LLMs add small adapters while keeping the core LLM frozen.

Numbers: Trainable params typically ≈2% (projectors only); PEFT can be <0.1%

Survey of financial LLMs: techniques, benchmarks, and practical gaps

0.50

0.40

0.60

14

FinLLMs help automate common finance language tasks but are uneven: use task-finetuned PLMs for classification/NER to cut cost; reserve large LLMs for complex QA or exploratory uses with human checks.

Key finding

For sentiment analysis, mixed-domain PLMs achieved top scores, while instruction-finetuned LLMs matched but cost more.

Numbers: FLANG-ELECTRA F1=92%; FinMA-30B/GPT-4 F1≈87% (5-shot)

Monarch Mixer: replace attention and MLPs with sub-quadratic GEMM-friendly layers to speed long-context models

0.50

0.70

14

If you run models with long contexts or want lower parameter cost, M2 can cut compute or model size and improve throughput on many GPUs while keeping accuracy; expect implementation and kernel work before production parity on all hardware.

Key finding

M2-BERT matches BERT-base GLUE while cutting parameters.

Numbers: GLUE 79.9 vs 79.6; −27% params (M2 80M vs BERT 110M)

A 7B cancer-specialized LLM that matches or beats larger models on phenotype extraction and diagnosis generation

0.60

0.45

0.75

11

CancerLLM shows that a domain-tuned 7B model can reach or exceed larger models on cancer tasks while using far less GPU memory, lowering operational cost for hospitals and clinics.

Key finding

CancerLLM achieves state-of-the-art average F1 on diagnosis generation among evaluated models.

Numbers: Diagnosis average F1 = 86.81% (Table 1)

STC connector + audio branch: stronger video and audio understanding for Video-LLMs

0.70

0.55

0.45

10

VideoLLaMA 2 improves video and audio understanding while keeping encoder/Large-Model changes minimal; this lowers data and compute needed to reach strong open-source performance and speeds integration into product pipelines.

Key finding

Adding STC connector (RegStage + 3D conv) yields the best average video QA performance in the architecture sweep.

Numbers: Avg. acc. 45.1 (Table 1 green line)

SeaLLMs: language models tuned and tokenized for Southeast Asian languages

0.70

0.80

7

SeaLLMs let companies offer cheaper, smaller models that serve Southeast Asian languages better than general English-centric models, improving UX and reducing API costs for these markets.

Key finding

Vocabulary expansion sharply reduced token cost for non‑Latin SEA scripts.

Numbers: Thai token ratio improved from 9.09→1.87 (SeaLLM's, Table 1)