16 papers found

PromptCBLUE: convert CBLUE into a prompt-format, multi-task Chinese medical benchmark and report baselines

0.60
0.45
0.60
4

PromptCBLUE gives a practical, Chinese-language testbed for medical LLM products. It shows that inexpensive PEFT fine-tuning of open 13B models can beat few-shot API use, so companies can invest in targeted fine-tuning to improve medical features.

Key finding

Fine-tuned open-source 13B models outperform few-shot commercial APIs on PromptCBLUE.

Numbers: Baichuan-13B (LoRA fine-tuned) overall 0.71 vs GPT-4 few-shot 0.518

Tune lightweight prompts with counterfactual contrastive loss to reduce gender bias on downstream tasks

0.60
0.50
0.60
2

Co^2PT offers a low-cost way to reduce downstream gender bias: it freezes the main model, tunes small prompts, and avoids costly full-model retraining while lowering fairness gaps on real tasks.

Key finding

On Bias-STS-B, Co^2PT cuts the average absolute similarity-difference (Diff) from PT's 0.321 to 0.058.

Numbers: Diff: PT 0.321 -> Co^2PT 0.058 (Table 2)

Make transformer teachers teach CNN students better by aligning receptive fields and adding prompts

0.60
0.60
0.50
1

If you run face recognition on mobile or edge devices, distilling a high-performing Transformer into a CNN can boost verification accuracy substantially while keeping hardware-friendly inference.

Key finding

Cross-architecture KD with URFM+APT substantially improves large-scale verification.

Numbers: IJB-C TPR@FPR=1e-4: 94.4 (Ours) vs 89.13 (student baseline) +5.27

Cut off top layers: keep or improve classification accuracy while cutting model size by >80%

0.60
0.60
0.80
1

You can cut LLM layers to dramatically shrink model size and lower hosting and fine-tuning costs while keeping or improving classification accuracy on many few-shot tasks.

Key finding

GPT-2 XL (48→2 layers) improves average accuracy compared to full model under prompt-based fine-tuning

Numbers: 48-layer avg 77.04%2-layer avg 80.23% (Table II)

Prompt-tuning GatorTronGPT-20B gives efficient, higher-scoring clinical dialogue summaries

0.60
0.50
0.70
1

Prompt tuning lets teams deploy clinical summarization with much lower compute and faster turnaround than full fine-tuning while often improving quality if you have a large domain LLM.

Key finding

GatorTronGPT-20B prompt-tuned outperformed fine-tuned T5-Large on automatic metrics.

Numbers: Rouge-1 0.3628 vs 0.3425; BERTScore 0.7309 vs 0.6765 (Table 4)

Soft prompts + frozen large LLMs are parameter‑efficient and better for cross‑site and few‑shot clinical extraction.

0.70
0.45
0.65
1

Tuning only soft prompts on a frozen billion‑scale clinical LLM cuts compute and deployment costs while keeping or improving cross‑site and few‑shot extraction accuracy.

Key finding

Soft prompting with an unfrozen GatorTron-3.9B gave best concept extraction on drug‑ADE.

Numbers: strict F1 = 0.9118

RadioLLM: use LLMs for radio tasks via hybrid prompts and token reprogramming

0.60
0.60
0.50
0

RadioLLM lets you reuse LLM priors for multiple radio tasks, improving classification and denoising while cutting prompt overhead and latency in many benchmark scenarios.

Key finding

RadioLLM beats many baselines on modulation classification.

Numbers: OA: 58.10% (RML16A), 58.35% (RML16B), 68.19% (RML16C)

Steer a frozen language model toward more commonsense using a small auxiliary head and a reference-free scorer

0.60
0.60
0.70
0

BOOST improves generation commonsense without fine-tuning large models, so teams can upgrade deployed LMs cheaply by adding a small controller and a scorer.

Key finding

Reference-free O-Scorer correlates with human commonsense ratings and matches top reference-based metrics.

Numbers: O-Score (mean) T5: 0.284, GPT-3.5: 0.299, Gold: 0.365; BERTScore-all: 0.302

Use an LLM's own evaluation gradients to steer its outputs at inference, then compress those gradients into a fast prefix controller

0.60
0.70
0.60
0

You can steer deployed LLMs at inference without costly human labels or weight updates; train a small prefix once to get near-zero runtime cost and plug it into existing models to enforce safety or tone constraints.

Key finding

SELFCONTROL can fully eliminate email leakage on the evaluated privacy benchmark.

Numbers: Privacy dataset: '✓ Email' 580, '✓ Domain' 990 (Table 3)

Mix CLIP multimodal features with prompt tuning to detect fake news with few labels

0.60
0.60
0.65
0

SAMPLE helps detect multimodal fake news with far fewer labels and far fewer trainable parameters than full fine-tuning, lowering data and compute costs for real-world monitoring systems.

Key finding

Mixed prompting (M-SAMPLE) gives clear few-shot gains over standard fine-tuning.

Numbers: avg F1 +0.05 vs FT-RoBERTa (few-shot)

STIG: encode multi-stage introduction logic as stage tokens so an LLM writes an entire Introduction in one inference

0.60
0.60
0.60
0

Convert multi-call agent pipelines into a single finetuned model with stage tokens to reduce API calls, cut token costs, and produce more structurally coherent section drafts for academic-style content.

Key finding

STIG raises structural rationality (SR) substantially versus agentic baselines.

Numbers: SR 0.832 (STIG) vs 0.658 (AutoSurvey) on Qwen2.5-7B

Inject low-rank, input-dependent prompts into aggregated features to recover accuracy of low-bit quantized GNNs

0.70
0.60
0.70
0

LoRAP lets teams deploy low-bit GNNs with much of the original accuracy retained while keeping memory and speed gains; it is a small, trainable add-on compatible with existing QAT pipelines.

Key finding

GPF-LoRAP can recover severe INT4 accuracy losses on small benchmarks.

Numbers: REDDIT-BINARY, QAT-W4A4: +17.2% acc

Make one LLM-based user encoder serve many business scenarios by anchoring user profiles with queries and tiny soft prompts

0.80
0.45
0.80
0

One universal encoding plus small scenario prompts reduces model sprawl, lowers serving cost via KV-cache reuse, and improves live business metrics, so businesses can swap many specialized models for a single adaptable pipeline.

Key finding

Prompt-tuned Q-Anchor yields state-of-the-art discriminative and ranking performance on 10 real Alipay tasks.

Numbers: Avg AUC 0.8225; Avg KS 0.5267 (Table 2, C.1)

Steer logits at decode time to get fine-tuning-like gains without extra training

0.70
0.60
0.70
0

SVDecode can raise model accuracy or truthfulness by a few percentage points with no extra inference memory and little engineering effort, speeding deployment and reducing compute for task adaptation.

Key finding

SVDecode improves multiple-choice accuracy when combined with PEFT.

Numbers: Example: Qwen2.5-7B Prompt Tuning: 45.49%50.29% (+4.8 pp)

Use soft prompts and LLM label transfer to scale real-time in-game toxicity detection across games and languages

0.70
0.50
0.70
0

Switching to a single soft-prompted model cuts model count and maintenance while keeping similar accuracy; using LLM-assisted label transfer lowers non-English annotation costs and speeds roll-out across languages.

Key finding

A single soft-prompted model matches multi-step curriculum learning on game-level detection.

Numbers: Soft prompting overall macro F1 = 43.16% vs curriculum best 43.35% (Table 1)

Systematic review and 11-class taxonomy of 45 prompt optimization methods, datasets, and model gaps

0.40
0.35
0.45
0

Optimizing prompts often improves model outputs without costly retraining; however, inconsistent evaluations hide how well methods generalize, so businesses should validate prompt methods on their own balanced data before production.

Key finding

There are 45 distinct prompt optimization strategies covered by this review.

Numbers: 45 methods (reviewed)