LLM4Vuln + UniVul: separate an LLM's reasoning from retrieved knowledge, context, and prompts to measure real vulnerability-detection skill

January 29, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.5

Citation Count

17

Authors

Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Wei Ma, Lyuye Zhang, Yang Liu, Yingjiu Li

Links

Abstract / PDF

Why It Matters For Business

LLM4Vuln helps teams know whether an LLM truly reasons about vulnerabilities or just repeats retrieved knowledge; this prevents wasted engineering on useless retrievals and guides model+tool choices for auditing code and triage.

Summary TLDR

This paper introduces LLM4Vuln, a modular evaluation framework and UniVul benchmark that isolate an LLM's core vulnerability reasoning from external aids: retrievable vulnerability knowledge, surrounding code context, and prompt schemes. The authors build a retrievable knowledge base and 294 test functions across Solidity, Java, and C/C++, run 6 LLMs in 3,528 controlled scenarios, and measure marginal gains from each enhancement. Key findings: knowledge retrieval helps foundation models on logic-heavy Solidity (F1 roughly doubles on evaluated cases), context helps inconsistently, chain-of-thought (CoT) reduces false positives, and deep-reasoning models often need less external knowledge. The

Problem Statement

Automatic LLM-based vulnerability detection mixes three things: what the model intrinsically reasons about, what it knows from pretraining, and external aids like retrieved vulnerability reports, extra code context, and prompt tricks. That entanglement makes it hard to measure true reasoning skill. LLM4Vuln decouples those factors so you can quantify the marginal effect of each enhancement on vulnerability decisions.

Main Contribution

LLM4Vuln: a modular framework that separates vulnerability reasoning from retrieval, context, and prompts.

UniVul: a benchmark with retrievable knowledge and context-supplementable code in Solidity, Java, and C/C++.

Large-scale controlled study: 6 LLMs × 294 samples × 3,528 scenarios measuring knowledge, context, and prompt effects.

Real-world validation: discovery of 14 confirmed zero-day Solidity bugs, $3,576 in bounties.

Key Findings

Knowledge retrieval helps foundation models on logic-heavy Solidity but not uniformly elsewhere.

NumbersF1 for traditional foundation models on Solidity nearly doubled on average with knowledge

External knowledge often harms or yields little benefit for deep reasoning models.

NumbersMost deep-reasoning model combinations saw drops in TP/recall when knowledge was supplied (across languages)

Context supplementation yields small and inconsistent gains.

NumbersSolidity: 22 of 36 model-prompt-knowledge combos increased TP with context; other languages show mixed patterns

Chain-of-thought (CoT) prompts reduce false positives and increase true negatives more reliably than they increase recall.

NumbersCoT reduced FPs and increased TNs in ~12 of 18 model combinations on Solidity and Java

Framework found real bugs: 14 confirmed zero-day vulnerabilities in four Solidity projects.

Numbers29 issues submitted, 14 confirmed, total bounties $3,576

Automated annotation by GPT-4.1 is reliable for binary vulnerability labels and acceptable for type labels.

NumbersBinary label accuracy 100%; vulnerability-type accuracy 81% (Solidity), 98% (Java), 97% (C/C++) on sampled checks

Knowledge retrieval quality varies by language and retrieval alignment.

NumbersSolidity: 68/100 sampled positive cases had aligned knowledge; Java: 36/100 relevant retrievals

Results

Evaluation scenarios

Value3,528 controlled scenarios (294 code samples × knowledge/context/prompt combos)

Models tested

Value6 representative LLMs

Zero-day vulnerabilities found

Value14 confirmed bugs

Bug bounty reward

Value$3,576 total

Accuracy

Value100% on sampled checks

Accuracy

Value81% (Solidity), 98% (Java), 97% (C/C++)

Who Should Care

What To Try In 7 Days

Run the UniVul Solidity baseline to see your model's out-of-box reasoning.

Add a small retrieval DB (top-3 FAISS) and compare precision/recall; expect gains for logic-heavy contracts.

Try chain-of-thought prompts when false positives are costly; measure FP/TN shifts not just accuracy.

Agent Features

Tool Use

  • function calling APIs
  • vector DB retrieval

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Retrieval quality varies: Solidity retrieval aligned ~68% on sampled positives; Java only ~36% (§7.2).
  • Context can distract models; benefits are inconsistent across languages and models (§6.2).
  • Auto-annotation of vulnerability types is imperfect (81% accuracy for Solidity type labels).
  • Sanitization reduces but may not eliminate pretraining leakage for some models.

When Not To Use

  • For large monolithic C/C++/Java codebases where static analysis or fuzzing is already required to narrow targets.
  • When retrieval quality cannot be measured or you lack curated domain reports.

Failure Modes

  • Mismatched retrieval can mislead a model to focus on the wrong vulnerability and reduce recall.
  • Models may assume vulnerability presence from prompt content, increasing false positives.
  • Deep reasoning models can be harmed by noisy external knowledge, lowering precision.

Core Entities

Models

  • GPT-4.1
  • Phi-3-mini-128k
  • Llama-3-8B
  • o4-mini
  • DeepSeek-R1
  • QwQ-32B

Metrics

  • Precision
  • Recall
  • F1
  • TP
  • TN
  • FP
  • FN
  • FP-type

Datasets

  • UniVul (Knowledge sets + Testing sets)
  • Code4Rena (Solidity reports)
  • CWE (Java/C/C++)
  • CVE / BigVul (test samples)

Benchmarks

  • UniVul