Prompt Templates Papers — Parsed & Scored for Practitioners

Reprogram frozen LLMs to forecast time series using text prototypes and Prompt-as-Prefix

0.70

0.60

0.70

127

You can add time series forecasting to an existing LLM deployment with little extra training and often better accuracy in low-data and cross-domain cases.

Key finding

TIME-LLM improves average long-term MSE over a fine-tuned LLM baseline (GPT4TS).

Numbers: ≈12% average MSE reduction vs GPT4TS on evaluated long-term benchmarks

Small prompt formatting changes can swing LLM accuracy by tens of points

0.60

40

Small, innocuous prompt formatting choices can produce large and unpredictable swings in LLM performance, which can mislead model selection, harm user experience, or produce fragile products unless you test multiple formats.

Key finding

Formatting can change accuracy by very large amounts.

Numbers: Max spread 76 accuracy points (LLaMA-2-13B)

Fine-tune LLMs with map-context prompts to predict population and economic indicators

0.70

0.55

0.65

28

GeoLLM provides a low-cost geospatial signal from pretrained LLMs that can match or beat satellite nightlight baselines and work with hundreds to thousands of labels, making it useful where imagery is costly or missing.

Key finding

GeoLLM yields large gains over prompt-based and classic baselines on real geospatial tasks.

Numbers: 70% improvement in Pearson's r^2 vs nearest-neighbor/XGBoost baselines (paper claim)

Use a pre-trained LLM (GPT-3.5) as a zero-shot search operator and distill it into a white-box linear operator for MOEA/D

0.40

0.60

0.50

21

You can prototype new evolutionary operators with natural-language prompts and then distill them into cheap, explainable operators — reducing expert design time and cutting API cost after distillation.

Key finding

MOEA/D-LLM (GPT-3.5) produces competitive hypervolume (HV) on five real engineering RE instances.

Numbers: RE21 HV: 0.7936 vs MOEA/D 0.781 (Table I)

LLMs can generate SystemVerilog security assertions from natural-language comments

0.60

0.45

20

LLMs can speed drafting of hardware security checks by turning comments into assertions, cutting expert time, but outputs need automated validation and human review before deployment.

Key finding

LLMs can produce correct security assertions but quality varies widely.

Numbers: Average correctness 26.54% across 2,268 prompt types; best prompt 93.55%

Make LLMs think in program structures to improve code generation

0.70

0.60

0.50

17

SCoT is a low-cost change to prompts that raises real-world code suggestion quality and developer satisfaction, so product teams can get better auto-generated code without model retraining.

Key finding

SCoT increases strict correctness (Pass@1) on HumanEval vs CoT prompting.

Numbers: Pass@1 +13.79% (CoT 53.29 → SCoT 60.64)

Use retrieved similar programs and generated test cases in prompts to boost code-generation accuracy

0.60

0.50

16

AceCoder raises the chance that a single generated program is correct without fine-tuning, so teams can get better automated code suggestions cheaply if they can supply or index similar existing code.

Key finding

AceCoder markedly increases strict execution accuracy (Pass@1) over few-shot prompting on public benchmarks.

Numbers: Pass@1 +56.4% (MBPP); +70.7% (MBJP); +88.4% (MBJSP)

AdaPlanner: LLM planner that adaptively refines code-style plans from environment feedback

0.60

0.70

13

AdaPlanner cuts dependence on large labeled datasets and repeated LLM calls by adaptively revising code-style plans, saving annotation and API cost while improving performance on long-horizon text tasks.

Key finding

AdaPlanner achieves 91.79% overall success on 134 ALFWorld tasks.

Numbers: Success rate 91.79% (ALFWorld Table 2).

RGPT: recurrent boosting of LLMs lifts text-classification by ~1% per benchmark

0.50

0.60

0.40

11

If you use LLMs for classification tasks, boosting+recurrent ensembling can usually add ~1% absolute accuracy—useful for high-stakes labeling or automation where small gains pay off, but expect higher compute and training cost.

Key finding

RGPT improves accuracy over strong baselines on four standard datasets.

Numbers: SST-2 +0.88%; AG News +1.21%; Ohsumed +1.47%; MR +1.88%

Which prompt styles work best for zero-shot clinical NLP across GPT‑3.5, BARD, and LLAMA2

0.50

0.40

0.60

10

Prompt choice can cut or save labeling costs: a well-crafted zero-shot prompt often gets near supervised accuracy, reducing the need for costly annotations.

Key finding

Heuristic prompts gave the top zero-shot accuracy for clinical sense disambiguation.

Numbers: GPT‑3.5 heuristic accuracy = 0.96

Fine-tune quantized LLMs on tokenized EHR histories to beat Med‑BERT and other baselines on diagnosis and readmission prediction

0.50

0.40

9

CPLLM shows you can repurpose public LLMs for EHR forecasting with no domain pretraining, achieving modest but consistent gains; this can speed deployment and lower data‑preparation costs.

Key finding

CPLLM-Llama2 outperforms baselines on adult respiratory failure prediction by PR-AUC.

Numbers: PR-AUC 35.962% vs 35.050% (LogReg), +0.912% abs

XLT: a short, language-independent prompt template that boosts non‑English LLM performance

0.75

0.45

0.85

8

XLT is a low-cost way to lift non-English performance and narrow cross-language gaps without retraining models, making multilingual features cheaper and faster to deploy.

Key finding

XLT substantially improves arithmetic reasoning (MGSM) in zero-shot.

Numbers: text-davinci-003 MGSM zero-shot: 12.5 → 23.9 (+11.4)

Forecast future facts on temporal knowledge graphs using LLM in‑context learning with no fine‑tuning.

0.50

0.60

8

You can forecast structured future events from past facts using off‑the‑shelf LLMs without costly retraining, which speeds deployment and reduces model maintenance.

Key finding

Pretrained LLMs (ICL) reach near‑SOTA forecasting performance without fine‑tuning.

Numbers: LLM Hits@1 gap vs median supervised: -3.6% to +1.5%

Use LLMs (GPT-4 and local models) to match entities with far less labeled data and better robustness

0.70

0.40

0.60

8

LLMs can cut labeling needs and handle unseen entities better than fine-tuned PLMs, but costs, latency, and privacy tradeoffs matter; fine-tuning cheaper LLMs locally is a cost-effective alternative.

Key finding

GPT-4 achieves very strong zero-shot matching and often matches or beats fine-tuned PLMs

Numbers: GPT-4 average F1 86.80; >=89% F1 on 5 of 6 datasets (zero-shot)

RAIN: align frozen LLMs at inference by self-evaluation and token rewinding

0.60

0.70

0.60

7

RAIN lets you reduce harmful or untruthful outputs from deployed LLMs without costly retraining or human labels; trade latency for safety and consider using RAIN-generated data to finetune if latency is critical.

Key finding

RAIN raised harmlessness of LLaMA 30B from 82% to 97% on the HH dataset.

Numbers: 82% → 97%

LLM judges are prompt‑sensitive and internally noisy; here's a explainable toolkit to measure and de-noise them

0.60

0.50

0.40

6

Automated LLM judges can speed model comparisons but are prompt-sensitive and noisy; without testing and de-noising they can give misleading win rates.

Key finding

Prompt template choice strongly changes judge accuracy and bias.

Numbers: Accboth varies by template; best 0.667, many below 0.2 in tests

Large LMs can act as dialog judges in few-shot settings — but training data and example choice change the result.

0.60

0.40

0.60

6

You can use large or instruction-tuned LMs as quick, scalable judges of dialog quality to reduce human labeling cost. But scores are sensitive to model type, training data, and prompt design, so blind deployment risks bad decisions.

Key finding

Instruction-tuned LLMs best match human dialog judgments in few-shot.

Numbers: InstructGPT (175B) dialog-level overall Spearman ≈ 0.69 on FED

SheetCopilot: turn natural language into step-by-step spreadsheet actions using LLMs

0.60

0.65

0.45

6

SheetCopilot lets non-technical users automate many spreadsheet tasks by speaking plain English, lowering manual work and reducing mistakes, but it still needs human verification for critical data because full correctness is about 44% on evaluated tasks.

Key finding

High execution but moderate full correctness for GPT-3.5-Turbo with SheetCopilot.

Numbers: Exec@1 = 87.3%, Pass@1 = 44.3% (full 221 tasks)

PREFER: automatically grow and weight prompts by having an LLM reflect on its errors and refine prompts

0.60

0.65

0.70

6

PREFER automates prompt generation and ensemble weighting, yielding better few-shot accuracy and much lower API/time cost than heavy prompt-search methods.

Key finding

PREFER improves few-shot F1 across several NLI/classification tasks versus single-prompt baselines and prior prompt-ensemble/iterative methods.

Numbers: QNLI: 0.793 vs 0.720 (synonym ensemble); Liar: 0.744 vs 0.572 (synonym)

T4D: a new test that asks LLMs to act on theory-of-mind; FaR prompting raises GPT‑4 from 50% to 71%.

0.40

0.60

0.40

6

If you build agents that must act on what people believe, test them on action-oriented ToM tasks; a structured 'foresee+reflect' prompt can improve decisions cheaply but needs guardrails against bad foresight.

Key finding

LLMs score near-perfect on ToMi (inference questions) but much worse on T4D (action choices).

Numbers: GPT-4: ToMi 93% vs T4D 50% (Table 1)

GPT‑4 can extract why patients switch contraceptives from clinical notes and reveal group-level disparities

0.60

0.40

0.50

5

Zero‑shot LLM extraction can unlock reasons for medication changes from notes quickly, lowering annotation costs and enabling equity and quality analyses across patient subgroups.

Key finding

GPT‑4 correctly extracted reasons for switching on manual review.

Numbers: Accuracy 91.4%; hallucination rate 2.2% (n=93)

CEBench: zero-code toolkit to benchmark LLM pipelines for cost vs. effectiveness trade-offs

0.60

0.40

0.70

5

CEBench helps teams choose LLM deployments that meet accuracy needs while controlling real costs by comparing local vs online models, RAG vs few-shot, and by recommending Pareto-efficient plans.

Key finding

Lightweight online model plus RAG yields very high accuracy at minimal cost.

Numbers: Haiku+RAG F1=0.9585; cost≈ $0.0003 per prompt

Small wording or punctuation changes in Japanese prompts can cut LLM accuracy by half; model and language data matter.

0.40

0.50

0.30

5

Small prompt wording or punctuation changes can drastically change accuracy on non-English tasks; companies must test prompt variants and consider light in-language adapters to avoid surprise drops in production.

Key finding

GPT-4 accuracy on the JNLI task varied from 25.44% to 49.21% across near-synonymous templates.

Numbers: range 25.44%–49.21% (SD 9.56)

Use ChatGPT to teach a smaller model to score answers and explain why

0.60

0.70

5

AERA turns expensive LLM reasoning into a deployable, smaller model that scores answers and explains decisions, lowering inference costs and improving explainability for education products.

Key finding

Distilled LongT5 (AERA) improves scoring agreement over ChatGPT on evaluated subsets.

Numbers: Overall QWK +11% vs ChatGPT (paper abstract; Table 1)