BinSum: a 557K-function benchmark showing when LLMs can (and cannot) summarize binary code

December 15, 20238 min

Overview

Decision SnapshotReady For Pilot

The dataset and large-scale comparisons are strong evidence for benchmarking; practical use requires symbol recovery and careful decompiler choice, and production costs can be high for API models.

Citations12

Evidence Strength0.85

Confidence0.88

Risk Signals8

Trust Signals

Findings with numeric evidence: 10/10

Findings with evidence refs: 10/10

Results with explicit delta: 6/12

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 60%

Novelty: 70%

Authors

Xin Jin, Jonathan Larson, Weiwei Yang, Zhiqiang Lin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automated binary summaries can speed reverse engineering and threat triage, but quality hinges on decompilation and symbol availability; investing in decompilers and symbol recovery yields bigger gains than swapping models.

Who Should Care

Summary TLDR

The authors release BinSum, a 557K-function binary-to-summary benchmark across 4 architectures, 4 optimization levels, and 4 binary representations. They build an automated pipeline to extract developer comments as ground truth, develop an LLM-driven prompt synthesis/optimization flow, and propose a semantic-embedding similarity metric. On 4B inference tokens ($11.4K, 873 A100 hours) they show: decompiled code is the best input; stripping symbols causes large semantic loss (−55%); ChatGPT tops with symbols (0.543 similarity) while Code Llama is best on stripped binaries (~0.284); Hex-Rays decompiler and function names matter most. Zero-shot prompts give the best cost/benefit tradeoff.

Problem Statement

Can generative large language models reliably summarize the semantics of binary functions? The task lacks large real-world datasets, binary formats vary (bytes, asm, IR, decompiled), and closed-source LLMs are black boxes making prompt design and evaluation difficult.

Main Contribution

BinSum: an open dataset of 557,664 binary functions compiled across 4 architectures and 4 optimization levels, with 4 representations (raw bytes, assembly, IR, decompiled) and comment-based ground truth.

A four-step LLM-driven prompt synthesis and optimization pipeline to find high-performing prompts at scale.

Key Findings

Stripping debugging symbols dramatically reduces decompiled-code semantics.

Numbers55.0% drop in semantic similarity (0.449 -> 0.202)

Practical UseKeep or recover symbols (function/var names/types) when you need accurate LLM summaries; symbol-recovery tools matter.

Evidence Ref§4.2, Figure 6

Among binary representations, decompiled code yields best LLM understanding.

Numbersdecompiled (debug) similarity 0.449 vs assembly 0.188 vs raw bytes 0.118

Practical UsePrefer decompiled output as LLM input for better summaries; pipeline should produce decompiled code before querying models.

Evidence Ref§4.2, Figure 6

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
dataset_size557,664 binary functionsBinSumCompiled 44 GNU projects into 4 archs and 4 opt levels (§1.3, §3)§1.3, §3
total_inference_scale4,058,297,977 tokens; $11,418; 873 A100 GPU hoursfull evaluationReported cost and token/GPU usage for whole study (abstract, §4.1)Abstract, §4.1

What To Try In 7 Days

Run a small pilot: feed Hex‑Rays decompiled functions with symbols into ChatGPT to measure summary utility on your corpus.

If binaries are stripped, benchmark Code Llama and test a lightweight symbol-recovery step first.

Adopt the semantic-embedding scorer from this paper to evaluate summaries beyond BLEU/ROUGE at scale.

Optimization Features

Token Efficiency
zero-shot prompts favored to limit token cost
Infra Optimization
used DeepSpeed, A100 GPUs for large-model runs
Inference Optimization
mixed-precision (BF16) for local Llama/Code Llama runsbatching and multithreading for API/models

Reproducibility

Code AvailableYes
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Dataset built from 44 GNU projects — not representative of IoT/malware/obfuscated binaries.

Ground truth uses developer comments which may be noisy or inconsistent.

When Not To Use

For stripped-obfuscated malware without a symbol-recovery stage — summaries will be partial or misleading.

When formal or verifiable program understanding is required (LLMs can hallucinate).

Failure Modes

Hallucination or noisy verbose summaries (GPT-4 observed to add superfluous detail).

Susceptibility to manipulated function names that steer summaries.

Core Entities

Models

GPT-4ChatGPTLlama 2 (7B)Llama 2 (13B)Code Llama (7B)Code Llama (13B)BinT5

Metrics

semantic-embedding similarity (all-mpnet-base-v2)BLEUMETEORROUGE-L

Datasets

BinSum (this work)

Benchmarks

BinSum

Context Entities

Models

CodeT5Llama 2 (base)

Datasets

GNU projects (44 repos compiled)CommonCrawl (mentioned)BigCode (mentioned)