Instruction Design Papers — Parsed & Scored for Practitioners

Instruction finetuning small open LLMs (Alpaca, FLAN-T5) boosts mental-health prediction to match or beat much larger models

0.25

0.60

0.55

59

Finetuning small open LLMs on a variety of labeled mental-health texts can yield classifiers that match or beat much larger models, reducing inference cost and vendor dependence while preserving multi-task flexibility.

Key finding

Instruction finetuning markedly improves performance over prompting.

Numbers: Alpaca finetuned: +23.4% balanced accuracy vs Alpaca zero-shot

Appending short emotional phrases to prompts measurably improves LLM outputs

0.60

0.50

0.40

57

A very low-cost prompt change (add one short emotional sentence) can raise automated and human-perceived output quality, reduce hallucinations, and improve responsibility—useful for chat assistants, content generation, and QA systems where marginal gains matter.

Key finding

EmotionPrompt raised average deterministic benchmark scores on Instruction Induction.

Numbers: 8.00% relative improvement on Instruction Induction (Table 1)

LLMs excel at simple sentiment tasks but struggle with fine-grained, structured sentiment extraction

0.60

0.40

0.60

55

Off-the-shelf LLMs can replace expensive labeling for basic sentiment tasks and speed up pilot projects, but structured extraction and safety-sensitive detection still need specialist models or human review.

Key finding

LLMs match fine-tuned small models on simple sentiment classification in zero-shot.

Numbers: ChatGPT ≈97% of T5 performance on SC tasks (paper text).

PIXIU: open financial LLM + 136K instruction examples and FLARE benchmark

0.60

0.50

43

Open domain-tuned models and labeled instruction data lower the bar to build finance-specific AI: cheaper customization, reproducible evaluation, and better performance on common text tasks; numeric QA and trading signals still need extra work.

Key finding

They built FIT with 136,609 instruction‑tuning examples across 5 tasks and 9 datasets.

Numbers: 136,609 samples; 5 tasks; 9 datasets

Shift a few attention-head activations at inference to make LLMs answer more truthfully

0.60

0.70

39

ITI is a low-cost way to reduce factual errors without heavy finetuning; it can be added to deployed models that expose activations to improve trustworthiness quickly.

Key finding

ITI greatly increases truthfulness on TruthfulQA for instruction-tuned models.

Numbers: Alpaca true*informative 32.5% -> 65.1%

OptiGuide: use LLMs to translate plain-English what‑if questions into solver code and human explanations without sending private data

0.70

0.50

0.70

32

OptiGuide speeds what‑if and root‑cause analysis for planners, reduces engineer on‑call cycles, and keeps sensitive data in‑house while surfacing solver decisions in plain English.

Key finding

GPT‑4 achieves high accuracy answering quantitative supply‑chain questions when given examples in the prompt.

Numbers: ≈93% average accuracy (GPT‑4, in‑distribution)

Train a language model to follow feedback by conditioning on ranked model outputs and natural-language feedback

0.60

0.50

27

CoH gives a cheaper path to human-aligned outputs: you can use existing human preference labels and a simple finetuning loop instead of training a reward model and running RL, speeding iteration and reducing engineering risk.

Key finding

CoH wins the majority of pairwise human comparisons on summarization versus the pretrained base model.

Numbers: CoH chosen 57.5% vs Base 19.9% (∆ +37.6 pp) on summarization human eval

ChatGPT/GPT‑4 can directly rank search passages with simple prompts; distilled small models inherit that power.

0.65

0.55

0.60

23

LLMs can directly re-rank search results zero‑shot and produce supervisory labels to train small, cheaper re‑rankers; this can cut inference cost and maintenance versus training large supervised re‑rankers on noisy labels.

Key finding

GPT‑4 outperforms strong supervised re‑rankers on standard benchmarks when using permutation prompts.

Numbers: nDCG@10: GPT‑4 53.68 vs monoT5 (3B) 51.36 on BEIR (avg), delta +2.32

DAIL-SQL: prompt+example selection that sets a new Spider Text-to-SQL high (86.6% EX)

0.70

0.60

23

DAIL-SQL gives a practical recipe to improve Text-to-SQL accuracy while cutting token cost; that reduces API spend and speeds up production query interfaces.

Key finding

DAIL-SQL sets a new Spider top with GPT-4 and self-consistency.

Numbers: 86.6% execution accuracy (leaderboard, with self-consistency)

Turn decoder-only LLMs into strong text encoders with three cheap steps

0.70

0.60

0.70

20

You can convert existing decoder-only LLMs into high-quality embedder models cheaply and fast (hours on one GPU) without labeled data, unlocking better retrieval and tagging with fewer resources than full retraining.

Key finding

LLM2Vec applied to Mistral-7B yields the top unsupervised MTEB score reported in the paper.

Numbers: 56.80 (MTEB avg-56, unsupervised, Mistral-7B)

Train a vision-language model to read and reason across many images in one prompt

0.60

0.70

0.50

18

If your product must reason over multiple images together (multi-photo chat, visual QA over albums, video snapshots), MMICL-style models reduce hallucinations and improve multi-image reasoning by adding explicit image tokens and multi-image instruction tuning.

Key finding

MMICL improves matching of captions to images on compositional image/text puzzles (Winoground).

Numbers: Text 45 / Image 45 / Group 43 (MMICL FLAN-T5-XXL, Table 2)

AlpaCare: fine-tuning LLaMA with a 52k machine-generated medical instruction dataset to improve medical and general instruction following

0.60

15

A small, diverse machine-generated medical instruction dataset can improve both medical answer quality and general instruction-following, offering a cost-effective way to build better clinical assistants while keeping development and data costs lower than large human-annotation efforts.

Key finding

AlpaCare gives large absolute gains on free-form medical instruction evaluation compared to prior baselines.

Numbers: up to 38.1% absolute gain (paper claim)

Teach an LLM to read graph structure with two-stage instruction tuning and a tiny alignment projector

0.50

0.60

15

GraphGPT enables LLMs to use graph structure with low-cost tuning, improving cross-dataset predictions and saving compute by using compact graph tokens instead of long text prompts.

Key finding

GraphGPT improves zero-shot transfer accuracy compared to base LLMs and GNNs on evaluated benchmarks.

Numbers: Arxiv-PubMed zero-shot: GraphGPT-7B-v1.5-std Acc=0.7011 vs vicuna-7B-v1.5 Acc=0.6351 (Δ=+0.066)

BinSum: a 557K-function benchmark showing when LLMs can (and cannot) summarize binary code

0.60

0.70

0.65

12

Automated binary summaries can speed reverse engineering and threat triage, but quality hinges on decompilation and symbol availability; investing in decompilers and symbol recovery yields bigger gains than swapping models.

Key finding

Stripping debugging symbols dramatically reduces decompiled-code semantics.

Numbers: 55.0% drop in semantic similarity (0.449 -> 0.202)

Use an LLM (GPT-3.5) to warmstart, model, and sample for Bayesian optimization; improves early-stage hyperparameter tuning

0.60

0.70

0.40

11

LLAMBO can reduce the number of expensive evaluations in hyperparameter tuning by using an LLM for initial guesses, early surrogates, and targeted sampling; trade off higher per-iteration compute and API cost for fewer total experiments.

Key finding

Zero-shot LLM warmstarting beats random initializations for HPO tasks.

Numbers: evaluated over 25 trials with 5 init points; improvement visible for trials < 5

Use natural-language instructions + LLM priors to steer multi‑agent RL toward human-friendly equilibria

0.60

0.30

10

You can steer multi-agent systems to human-friendly conventions without costly human behavior datasets; showing the agent's instruction to users sharply improves team performance and trust.

Key finding

In the Say-Select toy game, instructQ reliably converged to the intended human-like equilibrium.

Numbers: 10/10 random seeds converged to the instructed policy

Which prompt styles work best for zero-shot clinical NLP across GPT‑3.5, BARD, and LLAMA2

0.50

0.40

0.60

10

Prompt choice can cut or save labeling costs: a well-crafted zero-shot prompt often gets near supervised accuracy, reducing the need for costly annotations.

Key finding

Heuristic prompts gave the top zero-shot accuracy for clinical sense disambiguation.

Numbers: GPT‑3.5 heuristic accuracy = 0.96

Have LLMs 'think about their thinking' to boost understanding on NLU tasks

0.60

0.70

0.30

9

Metacognitive Prompting is a low‑cost way to improve model understanding on domain text (law, medicine) without retraining; expect modest average gains and larger wins on specialized datasets.

Key finding

MP gives a consistent aggregate performance uplift over CoT in zero‑shot settings.

Numbers: Relative boost 4.8%–6.4% vs CoT (zero‑shot, averaged across models)

Use LLMs to auto-generate hardware test inputs and recover coverage that random testing misses

0.55

0.65

0.45

9

LLM-driven stimulus generation can cut manual effort in hardware verification and replace inefficient random testing for many components, but it needs prompt tuning and careful model selection.

Key finding

For several modules, LLMs reached full (100%) coverage on the evaluated coverage plans.

Numbers: 100% coverage on Asynchronous FIFO & AMPLE Weight Bank (Table III)

Fine-tune LLaMA2 with context and video descriptions to improve emotion recognition in conversations

0.60

0.70

9

Fine-tuning an open 7B LLM with emotion and context data gives SOTA emotion detection while staying cheap to train, enabling faster builds of emotion-aware agents and analytics.

Key finding

DialogueLLM achieved state-of-the-art accuracy and F1 on three ERC benchmarks after emotion/context fine-tuning.

Numbers: MELD Acc 71.96%, F1 71.90; IEMOCAP Acc 70.62%, F1 69.93; EmoryNLP Acc 41.88%, F1 40.05

RoleLLM: a dataset and recipe to teach LLMs character-level role-playing

0.60

0.50

8

RoleBench and RoCIT let teams fine-tune open LLMs to mimic character voices and embed role facts, reducing dependence on costly closed-source APIs and long prompts.

Key finding

Context-Instruct substantially boosts role-specific knowledge (SPE metric).

Numbers: SPE: 21.4 -> 38.1

QDAIF: use LMs to both generate and judge to evolve diverse, high‑quality text

0.60

0.50

7

QDAIF automates generation plus subjective evaluation so teams can produce many distinct, human‑preferred text options without custom heuristics or expensive human labeling; useful for creative briefs, A/B content pools, and synthetic data generation.

Key finding

QDAIF sets scored higher in human-assessed quality‑diversity than most baselines.

Numbers: Human QD score: QDAIF 0.772 vs Random-Search 0.606 (Table 1)

Fine-tuned Chinese LLM that answers mental-health Q&A using a CBT (therapeutic) response structure

0.30

0.50

0.20

6

Fine-tuning LLMs with therapy‑structured prompts creates more structured, CBT‑aligned replies for Chinese mental‑health Q&A; useful for building triage assistants and clinician support tools but not a replacement for professionals.

Key finding

Created a CBT QA dataset with 22,327 entries.

Numbers: 22,327 entries (Table 1)

Make LLMs more creative by running multi‑round role‑played discussions instead of single prompts

0.60

0.65

0.35

6

A structured multi‑agent, role‑played discussion can produce noticeably more original and detailed ideas than single prompts, useful for ideation, product concepts, and creative marketing at modest engineering cost.

Key finding

LLM Discussion increases originality on AUT compared to single‑agent baseline

Numbers: Originality mean 4.44 vs 3.47 (LLM eval, AUT, Table 2)