Overview
The dataset and large-scale comparisons are strong evidence for benchmarking; practical use requires symbol recovery and careful decompiler choice, and production costs can be high for API models.
Citations12
Evidence Strength0.85
Confidence0.88
Risk Signals8
Trust Signals
Findings with numeric evidence: 10/10
Findings with evidence refs: 10/10
Results with explicit delta: 6/12
Reproducibility
Status: Code + data available
Open source: Partial
At A Glance
Cost impact: 65%
Production readiness: 60%
Novelty: 70%
Why It Matters For Business
Automated binary summaries can speed reverse engineering and threat triage, but quality hinges on decompilation and symbol availability; investing in decompilers and symbol recovery yields bigger gains than swapping models.
Who Should Care
Summary TLDR
The authors release BinSum, a 557K-function binary-to-summary benchmark across 4 architectures, 4 optimization levels, and 4 binary representations. They build an automated pipeline to extract developer comments as ground truth, develop an LLM-driven prompt synthesis/optimization flow, and propose a semantic-embedding similarity metric. On 4B inference tokens ($11.4K, 873 A100 hours) they show: decompiled code is the best input; stripping symbols causes large semantic loss (−55%); ChatGPT tops with symbols (0.543 similarity) while Code Llama is best on stripped binaries (~0.284); Hex-Rays decompiler and function names matter most. Zero-shot prompts give the best cost/benefit tradeoff.
Problem Statement
Can generative large language models reliably summarize the semantics of binary functions? The task lacks large real-world datasets, binary formats vary (bytes, asm, IR, decompiled), and closed-source LLMs are black boxes making prompt design and evaluation difficult.
Main Contribution
BinSum: an open dataset of 557,664 binary functions compiled across 4 architectures and 4 optimization levels, with 4 representations (raw bytes, assembly, IR, decompiled) and comment-based ground truth.
A four-step LLM-driven prompt synthesis and optimization pipeline to find high-performing prompts at scale.
Key Findings
Stripping debugging symbols dramatically reduces decompiled-code semantics.
Among binary representations, decompiled code yields best LLM understanding.
Results
| Metric | Value | Baseline | Delta | Split / Dataset | Evidence | Evidence Ref |
|---|---|---|---|---|---|---|
| dataset_size | 557,664 binary functions | — | — | BinSum | Compiled 44 GNU projects into 4 archs and 4 opt levels (§1.3, §3) | §1.3, §3 |
| total_inference_scale | 4,058,297,977 tokens; $11,418; 873 A100 GPU hours | — | — | full evaluation | Reported cost and token/GPU usage for whole study (abstract, §4.1) | Abstract, §4.1 |
What To Try In 7 Days
Run a small pilot: feed Hex‑Rays decompiled functions with symbols into ChatGPT to measure summary utility on your corpus.
If binaries are stripped, benchmark Code Llama and test a lightweight symbol-recovery step first.
Adopt the semantic-embedding scorer from this paper to evaluate summaries beyond BLEU/ROUGE at scale.
Optimization Features
Token Efficiency
Infra Optimization
Inference Optimization
Reproducibility
Code URLs
Data URLs
Risks & Boundaries
Limitations
Dataset built from 44 GNU projects — not representative of IoT/malware/obfuscated binaries.
Ground truth uses developer comments which may be noisy or inconsistent.
When Not To Use
For stripped-obfuscated malware without a symbol-recovery stage — summaries will be partial or misleading.
When formal or verifiable program understanding is required (LLMs can hallucinate).
Failure Modes
Hallucination or noisy verbose summaries (GPT-4 observed to add superfluous detail).
Susceptibility to manipulated function names that steer summaries.

