BinSum: a 557K-function benchmark showing when LLMs can (and cannot) summarize binary code

Overview

Decision SnapshotReady For Pilot

The dataset and large-scale comparisons are strong evidence for benchmarking; practical use requires symbol recovery and careful decompiler choice, and production costs can be high for API models.

Citations12

Evidence Strength0.85

Confidence0.88

Risk Signals8

Trust Signals

Findings with numeric evidence: 10/10

Findings with evidence refs: 10/10

Results with explicit delta: 6/12

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 60%

Novelty: 70%

Authors

Xin Jin, Jonathan Larson, Weiwei Yang, Zhiqiang Lin

Links

Abstract / PDF / Code / Data

Why It Matters For Business

Automated binary summaries can speed reverse engineering and threat triage, but quality hinges on decompilation and symbol availability; investing in decompilers and symbol recovery yields bigger gains than swapping models.

Who Should Care

Product Manager ML Engineer Engineering Lead Data Scientist CTO

Summary TLDR

The authors release BinSum, a 557K-function binary-to-summary benchmark across 4 architectures, 4 optimization levels, and 4 binary representations. They build an automated pipeline to extract developer comments as ground truth, develop an LLM-driven prompt synthesis/optimization flow, and propose a semantic-embedding similarity metric. On 4B inference tokens ($11.4K, 873 A100 hours) they show: decompiled code is the best input; stripping symbols causes large semantic loss (−55%); ChatGPT tops with symbols (0.543 similarity) while Code Llama is best on stripped binaries (~0.284); Hex-Rays decompiler and function names matter most. Zero-shot prompts give the best cost/benefit tradeoff.

Problem Statement

Can generative large language models reliably summarize the semantics of binary functions? The task lacks large real-world datasets, binary formats vary (bytes, asm, IR, decompiled), and closed-source LLMs are black boxes making prompt design and evaluation difficult.

Main Contribution

BinSum: an open dataset of 557,664 binary functions compiled across 4 architectures and 4 optimization levels, with 4 representations (raw bytes, assembly, IR, decompiled) and comment-based ground truth.

A four-step LLM-driven prompt synthesis and optimization pipeline to find high-performing prompts at scale.

Key Findings

Stripping debugging symbols dramatically reduces decompiled-code semantics.

Numbers55.0% drop in semantic similarity (0.449 -> 0.202)

Practical UseKeep or recover symbols (function/var names/types) when you need accurate LLM summaries; symbol-recovery tools matter.

Evidence Ref§4.2, Figure 6

Among binary representations, decompiled code yields best LLM understanding.

Numbersdecompiled (debug) similarity 0.449 vs assembly 0.188 vs raw bytes 0.118

Practical UsePrefer decompiled output as LLM input for better summaries; pipeline should produce decompiled code before querying models.

Evidence Ref§4.2, Figure 6

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
dataset_size	557,664 binary functions	—	—	BinSum	Compiled 44 GNU projects into 4 archs and 4 opt levels (§1.3, §3)	§1.3, §3
total_inference_scale	4,058,297,977 tokens; $11,418; 873 A100 GPU hours	—	—	full evaluation	Reported cost and token/GPU usage for whole study (abstract, §4.1)	Abstract, §4.1

What To Try In 7 Days

Run a small pilot: feed Hex‑Rays decompiled functions with symbols into ChatGPT to measure summary utility on your corpus.

If binaries are stripped, benchmark Code Llama and test a lightweight symbol-recovery step first.

Adopt the semantic-embedding scorer from this paper to evaluate summaries beyond BLEU/ROUGE at scale.

Optimization Features

Token Efficiency

zero-shot prompts favored to limit token cost

Infra Optimization

used DeepSpeed, A100 GPUs for large-model runs

Inference Optimization

mixed-precision (BF16) for local Llama/Code Llama runsbatching and multithreading for API/models

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/xinjin95/BinSum

Data URLs

https://github.com/xinjin95/BinSum

Risks & Boundaries

Limitations

Dataset built from 44 GNU projects — not representative of IoT/malware/obfuscated binaries.

Ground truth uses developer comments which may be noisy or inconsistent.

When Not To Use

For stripped-obfuscated malware without a symbol-recovery stage — summaries will be partial or misleading.

When formal or verifiable program understanding is required (LLMs can hallucinate).

Failure Modes

Hallucination or noisy verbose summaries (GPT-4 observed to add superfluous detail).

Susceptibility to manipulated function names that steer summaries.

Core Entities

Models

GPT-4ChatGPTLlama 2 (7B)Llama 2 (13B)Code Llama (7B)Code Llama (13B)BinT5

Metrics

semantic-embedding similarity (all-mpnet-base-v2)BLEUMETEORROUGE-L

Datasets

BinSum (this work)

Benchmarks

BinSum

Context Entities

Models

CodeT5Llama 2 (base)

Datasets

GNU projects (44 repos compiled)CommonCrawl (mentioned)BigCode (mentioned)

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Stripping debugging symbols dramatically reduces decompiled-code semantics.

Among binary representations, decompiled code yields best LLM understanding.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

Context Entities

Models

Datasets

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding