Prompt Engineering Papers — Parsed & Scored for Practitioners

Reprogram frozen LLMs to forecast time series using text prototypes and Prompt-as-Prefix

0.70

0.60

0.70

127

You can add time series forecasting to an existing LLM deployment with little extra training and often better accuracy in low-data and cross-domain cases.

Key finding

TIME-LLM improves average long-term MSE over a fine-tuned LLM baseline (GPT4TS).

Numbers: ≈12% average MSE reduction vs GPT4TS on evaluated long-term benchmarks

Two-stage multimodal chain-of-thought lets sub‑1B models reason with images and text

0.60

0.45

96

You can get near-state-of-the-art multimodal reasoning with lightweight models by fine-tuning in two stages and fusing image features—this reduces hallucination and lowers compute cost versus running large multimodal LLMs.

Key finding

Predicting a rationale before the answer hurts small-model accuracy in one-stage text-only training.

Numbers: No-CoT 81.63% vs Reasoning 69.32% (↓12.31pp)

Let multiple copies of an LLM debate to improve reasoning and reduce hallucinations

0.50

0.60

0.40

85

If accuracy matters more than latency, running several LLM copies that debate can materially reduce wrong answers and hallucinations, producing higher-quality outputs for QA, math, and plan generation.

Key finding

Multiagent debate raises arithmetic accuracy from 67.0% to 81.8% on their test set.

Numbers: Arithmetic: 67.0% → 81.8% (Table 1)

Instruction finetuning small open LLMs (Alpaca, FLAN-T5) boosts mental-health prediction to match or beat much larger models

0.25

0.60

0.55

59

Finetuning small open LLMs on a variety of labeled mental-health texts can yield classifiers that match or beat much larger models, reducing inference cost and vendor dependence while preserving multi-task flexibility.

Key finding

Instruction finetuning markedly improves performance over prompting.

Numbers: Alpaca finetuned: +23.4% balanced accuracy vs Alpaca zero-shot

Appending short emotional phrases to prompts measurably improves LLM outputs

0.60

0.50

0.40

57

A very low-cost prompt change (add one short emotional sentence) can raise automated and human-perceived output quality, reduce hallucinations, and improve responsibility—useful for chat assistants, content generation, and QA systems where marginal gains matter.

Key finding

EmotionPrompt raised average deterministic benchmark scores on Instruction Induction.

Numbers: 8.00% relative improvement on Instruction Induction (Table 1)

LLMs excel at simple sentiment tasks but struggle with fine-grained, structured sentiment extraction

0.60

0.40

0.60

55

Off-the-shelf LLMs can replace expensive labeling for basic sentiment tasks and speed up pilot projects, but structured extraction and safety-sensitive detection still need specialist models or human review.

Key finding

LLMs match fine-tuned small models on simple sentiment classification in zero-shot.

Numbers: ChatGPT ≈97% of T5 performance on SC tasks (paper text).

PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

0.60

0.50

43

Open domain-tuned models and labeled instruction data lower the bar to build finance-specific AI: cheaper customization, reproducible evaluation, and better performance on common text tasks; numeric QA and trading signals still need extra work.

Key finding

They built FIT with 136,609 instruction‑tuning examples across 5 tasks and 9 datasets.

Numbers: 136,609 samples; 5 tasks; 9 datasets

Small prompt formatting changes can swing LLM accuracy by tens of points

0.60

40

Small, innocuous prompt formatting choices can produce large and unpredictable swings in LLM performance, which can mislead model selection, harm user experience, or produce fragile products unless you test multiple formats.

Key finding

Formatting can change accuracy by very large amounts.

Numbers: Max spread 76 accuracy points (LLaMA-2-13B)

Shift a few attention-head activations at inference to make LLMs answer more truthfully

0.60

0.70

39

ITI is a low-cost way to reduce factual errors without heavy finetuning; it can be added to deployed models that expose activations to improve trustworthiness quickly.

Key finding

ITI greatly increases truthfulness on TruthfulQA for instruction-tuned models.

Numbers: Alpaca true*informative 32.5% -> 65.1%

OptiGuide: use LLMs to translate plain-English what‑if questions into solver code and human explanations without sending private data

0.70

0.50

0.70

32

OptiGuide speeds what‑if and root‑cause analysis for planners, reduces engineer on‑call cycles, and keeps sensitive data in‑house while surfacing solver decisions in plain English.

Key finding

GPT‑4 achieves high accuracy answering quantitative supply‑chain questions when given examples in the prompt.

Numbers: ≈93% average accuracy (GPT‑4, in‑distribution)

Fine-tune LLMs with map-context prompts to predict population and economic indicators

0.70

0.55

0.65

28

GeoLLM provides a low-cost geospatial signal from pretrained LLMs that can match or beat satellite nightlight baselines and work with hundreds to thousands of labels, making it useful where imagery is costly or missing.

Key finding

GeoLLM yields large gains over prompt-based and classic baselines on real geospatial tasks.

Numbers: 70% improvement in Pearson's r^2 vs nearest-neighbor/XGBoost baselines (paper claim)

Train a language model to follow feedback by conditioning on ranked model outputs and natural-language feedback

0.60

0.50

27

CoH gives a cheaper path to human-aligned outputs: you can use existing human preference labels and a simple finetuning loop instead of training a reward model and running RL, speeding iteration and reducing engineering risk.

Key finding

CoH wins the majority of pairwise human comparisons on summarization versus the pretrained base model.

Numbers: CoH chosen 57.5% vs Base 19.9% (∆ +37.6 pp) on summarization human eval

ChatGPT/GPT‑4 can directly rank search passages with simple prompts; distilled small models inherit that power.

0.65

0.55

0.60

23

LLMs can directly re-rank search results zero‑shot and produce supervisory labels to train small, cheaper re‑rankers; this can cut inference cost and maintenance versus training large supervised re‑rankers on noisy labels.

Key finding

GPT‑4 outperforms strong supervised re‑rankers on standard benchmarks when using permutation prompts.

Numbers: nDCG@10: GPT‑4 53.68 vs monoT5 (3B) 51.36 on BEIR (avg), delta +2.32

DAIL-SQL: prompt+example selection that sets a new Spider Text-to-SQL high (86.6% EX)

0.70

0.60

23

DAIL-SQL gives a practical recipe to improve Text-to-SQL accuracy while cutting token cost; that reduces API spend and speeds up production query interfaces.

Key finding

DAIL-SQL sets a new Spider top with GPT-4 and self-consistency.

Numbers: 86.6% execution accuracy (leaderboard, with self-consistency)

Use a pre-trained LLM (GPT-3.5) as a zero-shot search operator and distill it into a white-box linear operator for MOEA/D

0.40

0.60

0.50

21

You can prototype new evolutionary operators with natural-language prompts and then distill them into cheap, explainable operators — reducing expert design time and cutting API cost after distillation.

Key finding

MOEA/D-LLM (GPT-3.5) produces competitive hypervolume (HV) on five real engineering RE instances.

Numbers: RE21 HV: 0.7936 vs MOEA/D 0.781 (Table I)

Turn decoder-only LLMs into strong text encoders with three cheap steps

0.70

0.60

0.70

20

You can convert existing decoder-only LLMs into high-quality embedder models cheaply and fast (hours on one GPU) without labeled data, unlocking better retrieval and tagging with fewer resources than full retraining.

Key finding

LLM2Vec applied to Mistral-7B yields the top unsupervised MTEB score reported in the paper.

Numbers: 56.80 (MTEB avg-56, unsupervised, Mistral-7B)

LLMs can generate SystemVerilog security assertions from natural-language comments

0.60

0.45

20

LLMs can speed drafting of hardware security checks by turning comments into assertions, cutting expert time, but outputs need automated validation and human review before deployment.

Key finding

LLMs can produce correct security assertions but quality varies widely.

Numbers: Average correctness 26.54% across 2,268 prompt types; best prompt 93.55%

Survey: where multimodal LLMs stand on reasoning, benchmarks, training recipes, and gaps

0.40

0.30

0.40

19

If your product needs reliable multimodal reasoning (e.g., visual QA, robotics planning), current models vary widely; instruction tuning and careful training stages materially improve results, but proprietary models still lead.

Key finding

Proprietary multimodal models outperform open-source models on reasoning-focused benchmarks.

Numbers: InfiMM-Eval overall: GPT-4V 74.44 vs InfiMM-LLaMA-13B 40.7

Train a vision-language model to read and reason across many images in one prompt

0.60

0.70

0.50

18

If your product must reason over multiple images together (multi-photo chat, visual QA over albums, video snapshots), MMICL-style models reduce hallucinations and improve multi-image reasoning by adding explicit image tokens and multi-image instruction tuning.

Key finding

MMICL improves matching of captions to images on compositional image/text puzzles (Winoground).

Numbers: Text 45 / Image 45 / Group 43 (MMICL FLAN-T5-XXL, Table 2)

Make LLMs think in program structures to improve code generation

0.70

0.60

0.50

17

SCoT is a low-cost change to prompts that raises real-world code suggestion quality and developer satisfaction, so product teams can get better auto-generated code without model retraining.

Key finding

SCoT increases strict correctness (Pass@1) on HumanEval vs CoT prompting.

Numbers: Pass@1 +13.79% (CoT 53.29 → SCoT 60.64)

Use retrieved similar programs and generated test cases in prompts to boost code-generation accuracy

0.60

0.50

16

AceCoder raises the chance that a single generated program is correct without fine-tuning, so teams can get better automated code suggestions cheaply if they can supply or index similar existing code.

Key finding

AceCoder markedly increases strict execution accuracy (Pass@1) over few-shot prompting on public benchmarks.

Numbers: Pass@1 +56.4% (MBPP); +70.7% (MBJP); +88.4% (MBJSP)

AlpaCare: fine-tuning LLaMA with a 52k machine-generated medical instruction dataset to improve medical and general instruction following

0.60

15

A small, diverse machine-generated medical instruction dataset can improve both medical answer quality and general instruction-following, offering a cost-effective way to build better clinical assistants while keeping development and data costs lower than large human-annotation efforts.

Key finding

AlpaCare gives large absolute gains on free-form medical instruction evaluation compared to prior baselines.

Numbers: up to 38.1% absolute gain (paper claim)

An open, continuously updated leaderboard that measures LLM multi-step reasoning using chain-of-thought prompts

0.60

0.40

0.60

15

Reasoning capability separates good conversational models from ones that can solve multi-step tasks; measuring it helps pick models for products that need math, code, or multi-step decisions.

Key finding

Reasoning performance scales with model size.

Numbers: GSM8k: GPT-4 92.0 vs LLaMA-65B 50.9

Teach an LLM to read graph structure with two-stage instruction tuning and a tiny alignment projector

0.50

0.60

15

GraphGPT enables LLMs to use graph structure with low-cost tuning, improving cross-dataset predictions and saving compute by using compact graph tokens instead of long text prompts.

Key finding

GraphGPT improves zero-shot transfer accuracy compared to base LLMs and GNNs on evaluated benchmarks.

Numbers: Arxiv-PubMed zero-shot: GraphGPT-7B-v1.5-std Acc=0.7011 vs vicuna-7B-v1.5 Acc=0.6351 (Δ=+0.066)