BinSum: a 557K-function benchmark showing when LLMs can (and cannot) summarize binary code

December 15, 20238 min

Overview

Production Readiness

0.6

Novelty Score

0.7

Cost Impact Score

0.65

Citation Count

12

Authors

Xin Jin, Jonathan Larson, Weiwei Yang, Zhiqiang Lin

Links

Abstract / PDF

Why It Matters For Business

Automated binary summaries can speed reverse engineering and threat triage, but quality hinges on decompilation and symbol availability; investing in decompilers and symbol recovery yields bigger gains than swapping models.

Summary TLDR

The authors release BinSum, a 557K-function binary-to-summary benchmark across 4 architectures, 4 optimization levels, and 4 binary representations. They build an automated pipeline to extract developer comments as ground truth, develop an LLM-driven prompt synthesis/optimization flow, and propose a semantic-embedding similarity metric. On 4B inference tokens ($11.4K, 873 A100 hours) they show: decompiled code is the best input; stripping symbols causes large semantic loss (−55%); ChatGPT tops with symbols (0.543 similarity) while Code Llama is best on stripped binaries (~0.284); Hex-Rays decompiler and function names matter most. Zero-shot prompts give the best cost/benefit tradeoff.

Problem Statement

Can generative large language models reliably summarize the semantics of binary functions? The task lacks large real-world datasets, binary formats vary (bytes, asm, IR, decompiled), and closed-source LLMs are black boxes making prompt design and evaluation difficult.

Main Contribution

BinSum: an open dataset of 557,664 binary functions compiled across 4 architectures and 4 optimization levels, with 4 representations (raw bytes, assembly, IR, decompiled) and comment-based ground truth.

A four-step LLM-driven prompt synthesis and optimization pipeline to find high-performing prompts at scale.

A semantic-embedding based evaluation metric for code summary similarity and a large empirical study (4B tokens, $11,418 cost, 873 A100 GPU hours) comparing GPT-4, ChatGPT, Llama 2, Code Llama, and BinT5.

Key Findings

Stripping debugging symbols dramatically reduces decompiled-code semantics.

Numbers55.0% drop in semantic similarity (0.449 -> 0.202)

Among binary representations, decompiled code yields best LLM understanding.

Numbersdecompiled (debug) similarity 0.449 vs assembly 0.188 vs raw bytes 0.118

Model ranking depends on symbol availability: ChatGPT best with symbols; Code Llama best without.

NumbersChatGPT 0.543 (decompiled w/ symbols); Code Llama ~0.284 (stripped)

Fine-tuning effects vary: Code Llama (fine-tuned from Llama 2) improves performance; BinT5 (fine-tuned on decompiled code) performs poorly here.

NumbersCode Llama up to 22.0% better than Llama 2; BinT5 similarity ~0.115

Inference speed differs substantially across models and access modes.

NumbersChatGPT 1.07s/sample vs GPT-4 3.1s/sample (2.9×); BinT5 fastest (0.22s)

Performance varies across CPU architectures; best-case gaps up to 16%.

Numbersx64 w/ symbols best (0.503); MIPS stripped best among stripped (up to 16.0% gap vs x86)

Optimization level (O0–O3) has minimal effect on LLM performance.

NumbersOnly ~1.47% similarity difference between O0 and O3

Choice of decompiler strongly affects LLM results; Hex-Rays leads.

NumbersHex-Rays outperforms Ghidra by up to 60.7% on stripped binaries

Function names contribute most to decompiled-code semantics.

NumbersRemoving function names drops similarity by 30.3%; var names 30.0%, types 19.5%

Zero-shot prompting offers the best balance of performance and cost at scale.

NumbersZero-shot similarity 0.282 vs few-shot 0.312 but few-shot uses ~7.2× more tokens

Results

dataset_size

Value557,664 binary functions

total_inference_scale

Value4,058,297,977 tokens; $11,418; 873 A100 GPU hours

semantic_similarity_source_code

Value0.474

semantic_similarity_decompiled_debug

Value0.449

semantic_similarity_decompiled_stripped

Value0.202

best_model_with_symbols

ValueChatGPT 0.543 average similarity

Baselineother tested LLMs

best_model_stripped

ValueCode Llama ~0.284 (7B) and 0.283 (13B)

Baselineother tested LLMs

BinT5_performance

Value≈0.115 semantic similarity

Baselineother LLMs

inference_time_per_sample

ValueChatGPT 1.07s; GPT-4 3.10s; Llama2-7B 0.83s; CodeLlama-13B 1.72s; BinT5 0.22s

decompiler_effect

ValueHex-Rays outperforms others up to 60.7%

BaselineGhidra, Angr

symbol_type_impact

ValueRemove function names -> −30.3% similarity

Baselineoriginal decompiled with symbols

optimization_level_gap

Value≈1.47% difference (O0 vs O3)

Who Should Care

What To Try In 7 Days

Run a small pilot: feed Hex‑Rays decompiled functions with symbols into ChatGPT to measure summary utility on your corpus.

If binaries are stripped, benchmark Code Llama and test a lightweight symbol-recovery step first.

Adopt the semantic-embedding scorer from this paper to evaluate summaries beyond BLEU/ROUGE at scale.

Optimization Features

Token Efficiency

  • zero-shot prompts favored to limit token cost

Infra Optimization

  • used DeepSpeed, A100 GPUs for large-model runs

Inference Optimization

  • mixed-precision (BF16) for local Llama/Code Llama runs
  • batching and multithreading for API/models

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Dataset built from 44 GNU projects — not representative of IoT/malware/obfuscated binaries.
  • Ground truth uses developer comments which may be noisy or inconsistent.
  • LLM landscape evolves quickly; closed models (GPT-4) are black boxes and results may change.

When Not To Use

  • For stripped-obfuscated malware without a symbol-recovery stage — summaries will be partial or misleading.
  • When formal or verifiable program understanding is required (LLMs can hallucinate).

Failure Modes

  • Hallucination or noisy verbose summaries (GPT-4 observed to add superfluous detail).
  • Susceptibility to manipulated function names that steer summaries.
  • Focus on low-level operations for raw/assembly/IR instead of high-level intent.

Core Entities

Models

  • GPT-4
  • ChatGPT
  • Llama 2 (7B)
  • Llama 2 (13B)
  • Code Llama (7B)
  • Code Llama (13B)
  • BinT5

Metrics

  • semantic-embedding similarity (all-mpnet-base-v2)
  • BLEU
  • METEOR
  • ROUGE-L

Datasets

  • BinSum (this work)

Benchmarks

  • BinSum

Context Entities

Models

  • CodeT5
  • Llama 2 (base)

Datasets

  • GNU projects (44 repos compiled)
  • CommonCrawl (mentioned)
  • BigCode (mentioned)