MARBLE: a unified benchmark for music audio representations across 18 tasks

June 18, 20237 min

Overview

Production Readiness

0.6

Novelty Score

0.6

Cost Impact Score

0.5

Citation Count

6

Authors

Ruibin Yuan, Yinghao Ma, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin, Le Zhuo, Yiqi Liu, Jiawen Huang, Zeyue Tian, Binyue Deng, Ningzhi Wang, Chenghua Lin, Emmanouil Benetos, Anton Ragni, Norbert Gyenge, Roger Dannenberg, Wenhu Chen, Gus Xia, Wei Xue, Si Liu, Shi Wang, Ruibo Liu, Yike Guo, Jie Fu

Links

Abstract / PDF

Why It Matters For Business

MARBLE gives a single, reproducible way to measure how well audio features transfer to many music tasks, helping teams pick pretrained models or prioritize fine-tuning where it's most needed.

Summary TLDR

MARBLE is a community benchmark and toolkit for evaluating music audio representations. It collects 18 downstream music tasks across 12 public datasets, defines a unified constrained/semi/unconstrained evaluation protocol, and publishes baseline results for nine open-source model variants (7 model families). Results show large pre-trained music models tend to do best, but important tasks (e.g., source separation, tagging) still lag previous SOTA and need more work. The repo and leaderboard are public for reproducible comparisons.

Problem Statement

Music understanding lacks a standard, broad benchmark. Existing evaluations are scattered, use varied setups, and omit sequence tasks. This makes fair comparison and progress tracking for music audio representations hard.

Main Contribution

A taxonomy of music tasks with four levels: acoustic, performance, score, and high-level description.

A unified benchmark (MARBLE) covering 18 discriminative MIR tasks on 12 public or commercial datasets.

A clear evaluation protocol with three tracks (constrained, semi-constrained, unconstrained), baseline results, and an open toolkit and leaderboard.

Key Findings

MARBLE unifies 18 tasks across 12 datasets to evaluate music representations.

Numbers18 tasks; 12 datasets (Table 1).

Larger pre-trained models and more pretraining data generally deliver better downstream performance on MARBLE.

NumbersMAP-MERT-v1-330M: 94.4% Nsynth pitch; larger models trend up in Fig.2.

Some tasks remain far from solved with universal frozen features, notably source separation and tagging.

NumbersMUSDB18 SDR ~5.3–5.6 vs previous SOTA 9.3–10.8; tagging ROC/AP still below some SOTAs (Table 3,4).

Frame-level (sequence) outputs matter: some models cannot be evaluated on sequence tasks.

NumbersMAP models provide high-rate embeddings (50–75 embeddings/sec) and succeed on sequence tasks; other baselines lack frame

MARBLE enforces a reproducible constrained track with compute/time limits.

NumbersConstrained track: frozen backbone + single-layer 512 MLP (or small LSTM/transformer); 1-week walltime on RTX3090 (Sec.3

Results

Accuracy

Value94.4%

BaselineMAP-MERT-v1-330M

MTT tagging ROC

Value91.4%

BaselineJukebox-5B (close: MAP-MERT-v1 ~91.1%)

MUSDB18 SDR (vocals/drums/bass/other)

Value≈5.3–5.6 dB (per-stem)

BaselineMAP family / MAP-Music2Vec range

Lyrics transcription (MulJam WER)

Value77.0% (lower is better)

BaselineMAP-MERT-v1-330M

Who Should Care

What To Try In 7 Days

Run the MARBLE constrained track on your model to get a fair, compute-limited comparison.

Compare a small vs large MAP-MERT checkpoint on your key task (e.g., tagging) to test scaling impact.

If you need source separation, run a fine-tuned separation head rather than frozen embeddings.

Reproducibility

Data Urls

  • Dataset links recorded in MARBLE toolkit; original datasets (MTT, MTG-Jamendo, GiantSteps, GTZAN, Nsynth, MedleyDB, VocalSet, GuitarSet, MUSDB18, MulJam)

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Some datasets have commercial licenses or limited size (e.g., GTZAN), causing bias or restricted access.
  • Selected metrics per task are limited; some tasks typically use multiple metrics but MARBLE uses one or two.
  • Generative tasks and symbolic music are not included in the first release.

When Not To Use

  • If you need end-to-end generative evaluation (music synthesis), MARBLE focuses on discriminative tasks only.
  • When your target data is proprietary or out-of-distribution vs MARBLE datasets, leaderboard numbers may not translate.
  • For production-quality source separation, rely on task-specific specialized pipelines rather than frozen MARBLE features alone.

Failure Modes

  • Frozen universal embeddings can perform poorly on tasks requiring fine temporal detail (e.g., source separation, beat downbeat) unless frame-level features exist.
  • Models trained with strong supervised pretraining can overfit to high-level labels and lose pitch/key sensitivity.
  • Leaderboard scores can hide dataset licensing caveats; partial submissions allowed for commercial datasets may complicate comparisons.

Core Entities

Models

  • MusiCNN
  • CLMR
  • Jukebox-5B
  • MULE
  • MAP-Music2Vec
  • MAP-MERT-v0
  • MAP-MERT-v1

Metrics

  • ROC-AUC
  • PR-AUC / AP
  • Accuracy
  • F1 (beat, 20ms tolerance)
  • R2 (valence/arousal)
  • SDR (Source-to-Distortion Ratio)
  • CER (Character Error Rate)
  • WER (Word Error Rate)

Datasets

  • MagnaTagATune (MTT)
  • MTG-Jamendo (MTG)
  • GiantSteps
  • GiantSteps-MTG-keys
  • GTZAN
  • Emomusic
  • MTG-MoodTheme
  • Nsynth
  • MedleyDB (MelodyDB)
  • VocalSet
  • GuitarSet
  • MUSDB18
  • MulJam / MulJam2.0
  • Jamendo

Benchmarks

  • MARBLE