Multi-layer trainable pooling + bidirectional attention helps similarity and retrieval; trade-offs exist for clustering/classification

September 4, 20247 min

Overview

Production Readiness

0.6

Novelty Score

0.45

Cost Impact Score

0.3

Citation Count

0

Authors

Yixuan Tang, Yi Yang

Links

Abstract / PDF

Why It Matters For Business

Small architecture changes in pooling and attention shift embedding quality between search/STS and clustering/classification. Choose pooling+attention by task: multi-layer pooling + bidirectional for search, simpler EOS-last + causal for clustering or classification.

Summary TLDR

This paper runs a controlled study that fine-tunes the same LLMs and training data while varying pooling and attention choices. Key result: combining a new Multi-Layers Trainable Pooling (uses hidden states from all layers) with bidirectional attention gives statistically better results on semantic textual similarity (STS) and retrieval on the MTEB benchmark. But that configuration can hurt clustering and classification. Results are validated on Mistral-7B-v0.1 and Qwen2-0.5B using Wilcoxon tests and public training datasets (≈1.4M examples). Code released.

Problem Statement

Existing LLM-based embedding papers differ in datasets, base models, pooling, and attention, making it unclear which design choices actually drive gains. The paper seeks fair, statistically tested comparisons of pooling and attention for LLM embeddings.

Main Contribution

A controlled, large-scale comparison of five pooling+attention combinations trained on the same 1.4M-example dataset and base LLMs.

Introduction of Multi-Layers Trainable Pooling: a cross-attention pooling layer that combines EOS/mean outputs from all LLM layers with learnable layer weights.

Statistical evaluation (Wilcoxon Signed Rank) across MTEB tasks showing task-dependent trade-offs: gains on STS/retrieval but losses on clustering/classification.

Robustness check on a smaller base model (Qwen2-0.5B) and release of code and models for replication.

Key Findings

Multi-Layers Trainable Pooling + bidirectional attention (Model 5) gives the best STS and retrieval scores on Mistral-7B.

NumbersSTS +0.0166; Retrieval +0.0226 vs EOS-last+causal (Table 4)

Switching to bidirectional attention consistently helps retrieval but harms clustering.

NumbersRetrieval gains +0.009–+0.0111; Clustering drops −0.0229 to −0.0417 (Table 3)

Adding a trainable pooling layer helps STS under causal attention but not other tasks.

NumbersLast-layer trainable pooling vs EOS-last: STS +0.0129* (Table 2)

Smaller base models show the same directional effects but larger relative variance and smaller overall gains.

NumbersQwen2-0.5B: Model5 STS +0.0372* but classification −0.0393* (Table 5)

Results

STS (Mistral-7B avg)

ValueModel5 0.8468 vs Model1 0.8302

BaselineModel1 EOS-last + causal 0.8302

Retrieval (Mistral-7B avg, NDCG@10)

ValueModel5 0.5620 vs Model1 0.5394

BaselineModel1 EOS-last + causal 0.5394

Accuracy

ValueModel5 0.7101 vs Model1 0.7244

BaselineModel1 EOS-last + causal 0.7244

Clustering (Mistral-7B avg V-measure)

ValueModel5 0.4257 vs Model1 0.4503

BaselineModel1 EOS-last + causal 0.4503

Who Should Care

What To Try In 7 Days

Run a quick A/B: replace EOS-last+causal with Multi-Layers Trainable Pooling + bidirectional on your retrieval index and measure NDCG@k.

If using causal models for similarity only, add a lightweight last-layer trainable pooling and test STS Spearman.

Add Wilcoxon (or paired) tests across your datasets to confirm significance before deploying model changes.

Optimization Features

Training Optimization

  • LoRA

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Evaluations are limited to the MTEB suite; other real-world datasets may behave differently.
  • Mean pooling was excluded; conclusions do not cover all pooling options.
  • Fine-tuning used LoRA and a fixed training recipe (1,000 steps); different tuning or longer training may change results.
  • Only two base LLM sizes were tested; behavior may differ for much larger or different-architecture models.

When Not To Use

  • When classification or clustering quality is the priority—bidirectional multi-layer pooling can reduce performance.
  • When using small/edge LLMs without capacity to benefit from complex pooling layers.
  • When latency or compute budget prohibits extra cross-attention pooling layers at inference.

Failure Modes

  • Bidirectional attention adds context and can introduce noise, harming clustering coherence.
  • Multi-layer pooling increases compute and parameters; on small models it can overfit and give no win.
  • Reported gains are small; without statistical tests, engineers may mistake noise for improvement.

Core Entities

Models

  • Mistral-7B-v0.1
  • Qwen2-0.5B
  • NV-embed
  • E5-mistral-7b-instruct

Metrics

  • STS: Spearman of cosine
  • Retrieval: NDCG@10
  • Accuracy
  • Clustering: V-measure

Datasets

  • Custom 1.4M training mix (MSMARCO, NQ, SQuAD, Quora, AllNLI, HotpotQA, etc.)
  • MTEB (evaluation)

Benchmarks

  • MTEB

Context Entities

Models

  • Llama3-8B

Metrics

  • Wilcoxon Signed Rank p-value (statistical test)

Datasets

  • MSMARCO
  • HotpotQA
  • STSB
  • Quora
  • NQ
  • SQuAD
  • TriviaQA

Benchmarks

  • Hugging Face MTEB leaderboard