GRAD-SUM: summarize model feedback to automatically produce generalizable prompts

July 12, 20246 min

Overview

Production Readiness

0.6

Novelty Score

0.5

Cost Impact Score

0.6

Citation Count

0

Authors

Derek Austin, Elliott Chartock

Links

Abstract / PDF

Why It Matters For Business

Automates prompt tuning for black-box LLMs, cutting manual time and raising held-out performance by ~14% on tested tasks.

Summary TLDR

GRAD-SUM is an automated prompt-optimization loop that collects natural-language "gradients" (LLM-written critiques), summarizes them into one general critique, and edits prompts to address the summary. It uses user-provided task descriptions and LLM-as-a-judge evaluation criteria so it works with tasks that lack exact answers. On common benchmarks the authors report an average improvement of ~14% from initial prompts and a ~5% benefit from the gradient-summarization step versus using raw gradients. The method is implemented with GPT-3.5 for evaluation/gradients and GPT-4o for prompt editing; datasets and example prompts are in the appendix.

Problem Statement

Prompt engineering for large language models is manual and slow. Existing automatic methods either require task-specific set-ups, are expensive (Monte Carlo search), or produce prompts that don't generalize. The paper asks: can we automate prompt search for black-box LLMs with cheaper, generalizable feedback?

Main Contribution

GRAD-SUM: a feedback-driven prompt optimizer that summarizes multiple natural-language gradients into one general critique.

A 5-module loop: generation, evaluation (LLM-as-a-judge), gradient generation, gradient summarization, and prompt editor.

Support for user task descriptions and LLM-as-a-judge criteria so the method works with tasks that lack exact answers.

Empirical comparison showing consistent gains vs the DSPY optimizer and an ablation showing summarization adds value.

Key Findings

Average improvement over initial prompts across tested datasets

Numbersavg +14% final validation rating

Gradient summarization improves generalization compared to using raw gradients

Numbersavg +5% final validation rating (with summarization)

GRAD-SUM outperforms DSPY on the evaluated datasets

Numbersper-dataset final ratings higher for GRAD-SUM (see Table 2)

Results

Final Validation Rating (GSM8K)

ValueGRAD-SUM 0.82

BaselineDSPY 0.755

Final Validation Rating (Orca Math)

ValueGRAD-SUM 0.575

BaselineDSPY 0.455

Final Validation Rating (Neural Bridge RAG)

ValueGRAD-SUM 0.915

BaselineDSPY 0.885

Final Validation Rating (HellaSwag)

ValueGRAD-SUM 0.795

BaselineDSPY 0.48

Final Validation Rating (MT & Vicuna Bench)

ValueGRAD-SUM 0.95

BaselineDSPY 0.823

Who Should Care

What To Try In 7 Days

Define clear task descriptions and a binary LLM-as-a-judge criterion for one production use case.

Run a GRAD-SUM style loop on 30 train / 200 val samples: generate, evaluate, collect failing outputs, produce gradients, summarize, then edit prompts.

Compare final prompts to your current prompts and to a DSPY baseline using the same evaluation metric and 200-sample validation.

Optimization Features

Model Optimization

  • discrete prompt search via iterative edits

System Optimization

  • beam management with Upper Confidence Bound selection
  • summarizing multiple critiques to reduce overfitting and API calls

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Only supports LLM-as-a-judge metrics; other numeric or domain metrics are not directly supported.
  • Evaluation depends on the evaluator LLM and may reflect its biases.
  • Method requires the user to craft task descriptions and evaluation criteria, which affect results.

When Not To Use

  • When you need strict numeric or domain-specific metrics not expressible as LLM judgments.
  • If you cannot run iterative API calls or need zero extra inference cost.
  • When evaluator LLMs are known to be biased for your domain and you lack a trusted judge.

Failure Modes

  • Editing on single-example gradients creates prompts that overfit (authors observed this without summarization).
  • Biased evaluator judgments lead to optimizing prompts for the evaluator, not the true user goal.
  • Small training samples can produce noisy gradients and unstable edits.

Core Entities

Models

  • gpt-3.5-turbo-0125
  • gpt-4o-2024-05-13

Metrics

  • final validation rating (binary match judged by LLM-as-a-judge)
  • average validation improvement

Datasets

  • GSM8K
  • Orca Math
  • Neural Bridge RAG
  • HellaSwag
  • HotPotQA
  • MMLU
  • MT & Vicuna Bench

Benchmarks

  • GSM8K
  • HotPotQA
  • HellaSwag
  • MMLU
  • MT-Bench/Vicuna