MARBLE: a unified benchmark for music audio representations across 18 tasks

Overview

Decision SnapshotReady For Pilot

MARBLE is a well-documented, public benchmark and toolkit with baseline results, but some datasets are commercial or limited and some tasks/metrics are still incomplete; use it for comparative evaluation, not final product validation.

Citations6

Evidence Strength0.80

Confidence0.88

Risk Signals9

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 2/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 50%

Production readiness: 60%

Novelty: 60%

Authors

Ruibin Yuan, Yinghao Ma, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin, Le Zhuo, Yiqi Liu, Jiawen Huang, Zeyue Tian, Binyue Deng, Ningzhi Wang, Chenghua Lin, Emmanouil Benetos, Anton Ragni, Norbert Gyenge, Roger Dannenberg, Wenhu Chen, Gus Xia, Wei Xue, Si Liu, Shi Wang, Ruibo Liu, Yike Guo, Jie Fu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

MARBLE gives a single, reproducible way to measure how well audio features transfer to many music tasks, helping teams pick pretrained models or prioritize fine-tuning where it's most needed.

Who Should Care

ML Engineer Data Scientist Product Manager CTO

Summary TLDR

MARBLE is a community benchmark and toolkit for evaluating music audio representations. It collects 18 downstream music tasks across 12 public datasets, defines a unified constrained/semi/unconstrained evaluation protocol, and publishes baseline results for nine open-source model variants (7 model families). Results show large pre-trained music models tend to do best, but important tasks (e.g., source separation, tagging) still lag previous SOTA and need more work. The repo and leaderboard are public for reproducible comparisons.

Problem Statement

Music understanding lacks a standard, broad benchmark. Existing evaluations are scattered, use varied setups, and omit sequence tasks. This makes fair comparison and progress tracking for music audio representations hard.

Main Contribution

A taxonomy of music tasks with four levels: acoustic, performance, score, and high-level description.

A unified benchmark (MARBLE) covering 18 discriminative MIR tasks on 12 public or commercial datasets.

Key Findings

MARBLE unifies 18 tasks across 12 datasets to evaluate music representations.

Numbers18 tasks; 12 datasets (Table 1).

Practical UseUse MARBLE to get a single, reproducible view of how a representation performs across many real MIR tasks.

Evidence RefSec.1, Sec.2, Table 1

Larger pre-trained models and more pretraining data generally deliver better downstream performance on MARBLE.

NumbersMAP-MERT-v1-330M: 94.4% Nsynth pitch; larger models trend up in Fig.2.

Practical UseExpect scaling model size and data to be an effective path to improve general music features; test scaled variants before inventing new pretraining losses.

Evidence RefTable 3, Fig.2, Sec.4

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	94.4%	MAP-MERT-v1-330M	—	Nsynth (Table 3)	MAP-MERT-v1-330M achieves 94.4% on Nsynth pitch classification under constrained track.	Table 3
MTT tagging ROC	91.4%	Jukebox-5B (close: MAP-MERT-v1 ~91.1%)	—	MagnaTagATune (MTT)	Jukebox-5B reported 91.4 ROC; MAP-MERT variants ~91.0–91.1 (Table 3).	Table 3

What To Try In 7 Days

Run the MARBLE constrained track on your model to get a fair, compute-limited comparison.

Compare a small vs large MAP-MERT checkpoint on your key task (e.g., tagging) to test scaling impact.

If you need source separation, run a fine-tuned separation head rather than frozen embeddings.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/a43992899/MARBLE-Benchmark https://marble-bm.shef.ac.uk

Data URLs

Dataset links recorded in MARBLE toolkit; original datasets (MTT, MTG-Jamendo, GiantSteps, GTZAN, Nsynth, MedleyDB, VocalSet, GuitarSet, MUSDB18, MulJam)

Risks & Boundaries

Limitations

Some datasets have commercial licenses or limited size (e.g., GTZAN), causing bias or restricted access.

Selected metrics per task are limited; some tasks typically use multiple metrics but MARBLE uses one or two.

When Not To Use

If you need end-to-end generative evaluation (music synthesis), MARBLE focuses on discriminative tasks only.

When your target data is proprietary or out-of-distribution vs MARBLE datasets, leaderboard numbers may not translate.

Failure Modes

Frozen universal embeddings can perform poorly on tasks requiring fine temporal detail (e.g., source separation, beat downbeat) unless frame-level features exist.

Models trained with strong supervised pretraining can overfit to high-level labels and lose pitch/key sensitivity.

Core Entities

Models

MusiCNNCLMRJukebox-5BMULEMAP-Music2VecMAP-MERT-v0MAP-MERT-v1

Metrics

ROC-AUCPR-AUC / APAccuracyF1 (beat, 20ms tolerance)R2 (valence/arousal)SDR (Source-to-Distortion Ratio)CER (Character Error Rate)WER (Word Error Rate)

Datasets

MagnaTagATune (MTT)MTG-Jamendo (MTG)GiantStepsGiantSteps-MTG-keysGTZANEmomusicMTG-MoodThemeNsynthMedleyDB (MelodyDB)VocalSetGuitarSetMUSDB18MulJam / MulJam2.0Jamendo

Benchmarks

MARBLE

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

MARBLE unifies 18 tasks across 12 datasets to evaluate music representations.

Larger pre-trained models and more pretraining data generally deliver better downstream performance on MARBLE.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding