Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.5
Citation Count
17
Why It Matters For Business
LLM4Vuln helps teams know whether an LLM truly reasons about vulnerabilities or just repeats retrieved knowledge; this prevents wasted engineering on useless retrievals and guides model+tool choices for auditing code and triage.
Summary TLDR
This paper introduces LLM4Vuln, a modular evaluation framework and UniVul benchmark that isolate an LLM's core vulnerability reasoning from external aids: retrievable vulnerability knowledge, surrounding code context, and prompt schemes. The authors build a retrievable knowledge base and 294 test functions across Solidity, Java, and C/C++, run 6 LLMs in 3,528 controlled scenarios, and measure marginal gains from each enhancement. Key findings: knowledge retrieval helps foundation models on logic-heavy Solidity (F1 roughly doubles on evaluated cases), context helps inconsistently, chain-of-thought (CoT) reduces false positives, and deep-reasoning models often need less external knowledge. The
Problem Statement
Automatic LLM-based vulnerability detection mixes three things: what the model intrinsically reasons about, what it knows from pretraining, and external aids like retrieved vulnerability reports, extra code context, and prompt tricks. That entanglement makes it hard to measure true reasoning skill. LLM4Vuln decouples those factors so you can quantify the marginal effect of each enhancement on vulnerability decisions.
Main Contribution
LLM4Vuln: a modular framework that separates vulnerability reasoning from retrieval, context, and prompts.
UniVul: a benchmark with retrievable knowledge and context-supplementable code in Solidity, Java, and C/C++.
Large-scale controlled study: 6 LLMs × 294 samples × 3,528 scenarios measuring knowledge, context, and prompt effects.
Real-world validation: discovery of 14 confirmed zero-day Solidity bugs, $3,576 in bounties.
Key Findings
Knowledge retrieval helps foundation models on logic-heavy Solidity but not uniformly elsewhere.
External knowledge often harms or yields little benefit for deep reasoning models.
Context supplementation yields small and inconsistent gains.
Chain-of-thought (CoT) prompts reduce false positives and increase true negatives more reliably than they increase recall.
Framework found real bugs: 14 confirmed zero-day vulnerabilities in four Solidity projects.
Automated annotation by GPT-4.1 is reliable for binary vulnerability labels and acceptable for type labels.
Knowledge retrieval quality varies by language and retrieval alignment.
Results
Evaluation scenarios
Models tested
Zero-day vulnerabilities found
Bug bounty reward
Accuracy
Accuracy
Who Should Care
What To Try In 7 Days
Run the UniVul Solidity baseline to see your model's out-of-box reasoning.
Add a small retrieval DB (top-3 FAISS) and compare precision/recall; expect gains for logic-heavy contracts.
Try chain-of-thought prompts when false positives are costly; measure FP/TN shifts not just accuracy.
Agent Features
Tool Use
- function calling APIs
- vector DB retrieval
Reproducibility
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Retrieval quality varies: Solidity retrieval aligned ~68% on sampled positives; Java only ~36% (§7.2).
- Context can distract models; benefits are inconsistent across languages and models (§6.2).
- Auto-annotation of vulnerability types is imperfect (81% accuracy for Solidity type labels).
- Sanitization reduces but may not eliminate pretraining leakage for some models.
When Not To Use
- For large monolithic C/C++/Java codebases where static analysis or fuzzing is already required to narrow targets.
- When retrieval quality cannot be measured or you lack curated domain reports.
Failure Modes
- Mismatched retrieval can mislead a model to focus on the wrong vulnerability and reduce recall.
- Models may assume vulnerability presence from prompt content, increasing false positives.
- Deep reasoning models can be harmed by noisy external knowledge, lowering precision.
Core Entities
Models
- GPT-4.1
- Phi-3-mini-128k
- Llama-3-8B
- o4-mini
- DeepSeek-R1
- QwQ-32B
Metrics
- Precision
- Recall
- F1
- TP
- TN
- FP
- FN
- FP-type
Datasets
- UniVul (Knowledge sets + Testing sets)
- Code4Rena (Solidity reports)
- CWE (Java/C/C++)
- CVE / BigVul (test samples)
Benchmarks
- UniVul

