Overview
Production Readiness
0.6
Novelty Score
0.6
Cost Impact Score
0.5
Citation Count
6
Why It Matters For Business
MARBLE gives a single, reproducible way to measure how well audio features transfer to many music tasks, helping teams pick pretrained models or prioritize fine-tuning where it's most needed.
Summary TLDR
MARBLE is a community benchmark and toolkit for evaluating music audio representations. It collects 18 downstream music tasks across 12 public datasets, defines a unified constrained/semi/unconstrained evaluation protocol, and publishes baseline results for nine open-source model variants (7 model families). Results show large pre-trained music models tend to do best, but important tasks (e.g., source separation, tagging) still lag previous SOTA and need more work. The repo and leaderboard are public for reproducible comparisons.
Problem Statement
Music understanding lacks a standard, broad benchmark. Existing evaluations are scattered, use varied setups, and omit sequence tasks. This makes fair comparison and progress tracking for music audio representations hard.
Main Contribution
A taxonomy of music tasks with four levels: acoustic, performance, score, and high-level description.
A unified benchmark (MARBLE) covering 18 discriminative MIR tasks on 12 public or commercial datasets.
A clear evaluation protocol with three tracks (constrained, semi-constrained, unconstrained), baseline results, and an open toolkit and leaderboard.
Key Findings
MARBLE unifies 18 tasks across 12 datasets to evaluate music representations.
Larger pre-trained models and more pretraining data generally deliver better downstream performance on MARBLE.
Some tasks remain far from solved with universal frozen features, notably source separation and tagging.
Frame-level (sequence) outputs matter: some models cannot be evaluated on sequence tasks.
MARBLE enforces a reproducible constrained track with compute/time limits.
Results
Accuracy
MTT tagging ROC
MUSDB18 SDR (vocals/drums/bass/other)
Lyrics transcription (MulJam WER)
Who Should Care
What To Try In 7 Days
Run the MARBLE constrained track on your model to get a fair, compute-limited comparison.
Compare a small vs large MAP-MERT checkpoint on your key task (e.g., tagging) to test scaling impact.
If you need source separation, run a fine-tuned separation head rather than frozen embeddings.
Reproducibility
Data Urls
- Dataset links recorded in MARBLE toolkit; original datasets (MTT, MTG-Jamendo, GiantSteps, GTZAN, Nsynth, MedleyDB, VocalSet, GuitarSet, MUSDB18, MulJam)
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Some datasets have commercial licenses or limited size (e.g., GTZAN), causing bias or restricted access.
- Selected metrics per task are limited; some tasks typically use multiple metrics but MARBLE uses one or two.
- Generative tasks and symbolic music are not included in the first release.
When Not To Use
- If you need end-to-end generative evaluation (music synthesis), MARBLE focuses on discriminative tasks only.
- When your target data is proprietary or out-of-distribution vs MARBLE datasets, leaderboard numbers may not translate.
- For production-quality source separation, rely on task-specific specialized pipelines rather than frozen MARBLE features alone.
Failure Modes
- Frozen universal embeddings can perform poorly on tasks requiring fine temporal detail (e.g., source separation, beat downbeat) unless frame-level features exist.
- Models trained with strong supervised pretraining can overfit to high-level labels and lose pitch/key sensitivity.
- Leaderboard scores can hide dataset licensing caveats; partial submissions allowed for commercial datasets may complicate comparisons.
Core Entities
Models
- MusiCNN
- CLMR
- Jukebox-5B
- MULE
- MAP-Music2Vec
- MAP-MERT-v0
- MAP-MERT-v1
Metrics
- ROC-AUC
- PR-AUC / AP
- Accuracy
- F1 (beat, 20ms tolerance)
- R2 (valence/arousal)
- SDR (Source-to-Distortion Ratio)
- CER (Character Error Rate)
- WER (Word Error Rate)
Datasets
- MagnaTagATune (MTT)
- MTG-Jamendo (MTG)
- GiantSteps
- GiantSteps-MTG-keys
- GTZAN
- Emomusic
- MTG-MoodTheme
- Nsynth
- MedleyDB (MelodyDB)
- VocalSet
- GuitarSet
- MUSDB18
- MulJam / MulJam2.0
- Jamendo
Benchmarks
- MARBLE

