Overview
Production Readiness
0.6
Novelty Score
0.7
Cost Impact Score
0.65
Citation Count
12
Why It Matters For Business
Automated binary summaries can speed reverse engineering and threat triage, but quality hinges on decompilation and symbol availability; investing in decompilers and symbol recovery yields bigger gains than swapping models.
Summary TLDR
The authors release BinSum, a 557K-function binary-to-summary benchmark across 4 architectures, 4 optimization levels, and 4 binary representations. They build an automated pipeline to extract developer comments as ground truth, develop an LLM-driven prompt synthesis/optimization flow, and propose a semantic-embedding similarity metric. On 4B inference tokens ($11.4K, 873 A100 hours) they show: decompiled code is the best input; stripping symbols causes large semantic loss (−55%); ChatGPT tops with symbols (0.543 similarity) while Code Llama is best on stripped binaries (~0.284); Hex-Rays decompiler and function names matter most. Zero-shot prompts give the best cost/benefit tradeoff.
Problem Statement
Can generative large language models reliably summarize the semantics of binary functions? The task lacks large real-world datasets, binary formats vary (bytes, asm, IR, decompiled), and closed-source LLMs are black boxes making prompt design and evaluation difficult.
Main Contribution
BinSum: an open dataset of 557,664 binary functions compiled across 4 architectures and 4 optimization levels, with 4 representations (raw bytes, assembly, IR, decompiled) and comment-based ground truth.
A four-step LLM-driven prompt synthesis and optimization pipeline to find high-performing prompts at scale.
A semantic-embedding based evaluation metric for code summary similarity and a large empirical study (4B tokens, $11,418 cost, 873 A100 GPU hours) comparing GPT-4, ChatGPT, Llama 2, Code Llama, and BinT5.
Key Findings
Stripping debugging symbols dramatically reduces decompiled-code semantics.
Among binary representations, decompiled code yields best LLM understanding.
Model ranking depends on symbol availability: ChatGPT best with symbols; Code Llama best without.
Fine-tuning effects vary: Code Llama (fine-tuned from Llama 2) improves performance; BinT5 (fine-tuned on decompiled code) performs poorly here.
Inference speed differs substantially across models and access modes.
Performance varies across CPU architectures; best-case gaps up to 16%.
Optimization level (O0–O3) has minimal effect on LLM performance.
Choice of decompiler strongly affects LLM results; Hex-Rays leads.
Function names contribute most to decompiled-code semantics.
Zero-shot prompting offers the best balance of performance and cost at scale.
Results
dataset_size
total_inference_scale
semantic_similarity_source_code
semantic_similarity_decompiled_debug
semantic_similarity_decompiled_stripped
best_model_with_symbols
best_model_stripped
BinT5_performance
inference_time_per_sample
decompiler_effect
symbol_type_impact
optimization_level_gap
Who Should Care
What To Try In 7 Days
Run a small pilot: feed Hex‑Rays decompiled functions with symbols into ChatGPT to measure summary utility on your corpus.
If binaries are stripped, benchmark Code Llama and test a lightweight symbol-recovery step first.
Adopt the semantic-embedding scorer from this paper to evaluate summaries beyond BLEU/ROUGE at scale.
Optimization Features
Token Efficiency
- zero-shot prompts favored to limit token cost
Infra Optimization
- used DeepSpeed, A100 GPUs for large-model runs
Inference Optimization
- mixed-precision (BF16) for local Llama/Code Llama runs
- batching and multithreading for API/models
Reproducibility
Code Urls
Data Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Dataset built from 44 GNU projects — not representative of IoT/malware/obfuscated binaries.
- Ground truth uses developer comments which may be noisy or inconsistent.
- LLM landscape evolves quickly; closed models (GPT-4) are black boxes and results may change.
When Not To Use
- For stripped-obfuscated malware without a symbol-recovery stage — summaries will be partial or misleading.
- When formal or verifiable program understanding is required (LLMs can hallucinate).
Failure Modes
- Hallucination or noisy verbose summaries (GPT-4 observed to add superfluous detail).
- Susceptibility to manipulated function names that steer summaries.
- Focus on low-level operations for raw/assembly/IR instead of high-level intent.
Core Entities
Models
- GPT-4
- ChatGPT
- Llama 2 (7B)
- Llama 2 (13B)
- Code Llama (7B)
- Code Llama (13B)
- BinT5
Metrics
- semantic-embedding similarity (all-mpnet-base-v2)
- BLEU
- METEOR
- ROUGE-L
Datasets
- BinSum (this work)
Benchmarks
- BinSum
Context Entities
Models
- CodeT5
- Llama 2 (base)
Datasets
- GNU projects (44 repos compiled)
- CommonCrawl (mentioned)
- BigCode (mentioned)

