Overview
Production Readiness
0.7
Novelty Score
0.6
Cost Impact Score
0.8
Citation Count
6
Why It Matters For Business
GLiNER gives production-ready open-type NER with 50M–300M models that beat ChatGPT zero-shot, cutting cost and latency while keeping competitive accuracy.
Summary TLDR
GLiNER is a compact NER system that treats open-type NER as matching entity-type prompts to span embeddings inside a bidirectional transformer. Trained on Pile-NER (≈45k texts, 240k spans, 13k types), GLiNER runs in parallel (not autoregressive), is cheap (50M–300M params), and achieves strong zero-shot F1s: GLiNER-L (0.3B) Avg F1 60.9 vs ChatGPT 47.5 on OOD benchmarks. It generalizes reasonably to many languages but struggles on noisy social media and some non-Latin languages (e.g., Bengali F1 0.89). Code and dataset pointers provided.
Problem Statement
Open-type NER (identify any entity type from text) is usually done with large autoregressive LLMs that are costly, slow (token-by-token), and hard to deploy. The paper asks: can a compact bidirectional model match or beat those LLMs in zero-shot open NER while being faster and cheaper?
Main Contribution
A new architecture (GLiNER) that encodes entity-type prompts and text together and matches entity embeddings to span embeddings in latent space.
Demonstration that compact BiLMs (50M–300M params) can outperform ChatGPT and some fine-tuned LLMs on zero-shot open NER benchmarks.
Training recipe and practical tricks: use Pile-NER for diverse entity types, negative entity sampling, and prompt dropping to improve robustness.
Fast, parallel decoding (span scoring) with O(n log n) selection, allowing multiple entity types to be predicted together.
Key Findings
GLiNER-L (300M) achieves average F1 60.9 on the OOD NER benchmark, outperforming ChatGPT.
A very small variant (GLiNER-S, 50M) still beats ChatGPT on the same benchmark.
Mid-sized GLiNER (90M) matches much larger UniNER (13B) on average.
Multilingual zero-shot: GLiNER-Multi (mdeBERTa-v3) outperforms ChatGPT on average but lags supervised per-language models.
Negative entity sampling during training materially affects precision/recall balance.
Results
OOD NER average F1
OOD NER average F1 (small)
OOD NER average F1 (mid)
Multilingual average F1
In-domain finetuning avg F1 (pretrained on Pile-NER)
Negative sampling F1
Who Should Care
What To Try In 7 Days
Run GLiNER-S (50M) on a representative NER task to measure cost/latency vs your current LLM API.
Fine-tune GLiNER on a small in-domain sample and compare zero-shot vs few-shot gains.
Adopt 50% negative entity sampling and random entity dropping during finetuning to balance precision and recall.
Agent Features
Frameworks
- Span representation + entity embedding matching
Architectures
- Bidirectional transformer encoder (deBERTa / BiLM)
Optimization Features
Token Efficiency
- Predicts spans in parallel so no token-by-token generation
Model Optimization
- Use deBERTa-v3 backbone for best results
Training Optimization
- Pretrain on Pile-NER for transfer
- Negative entity sampling (≈50%)
- Random entity dropping (prompt drop) as regularization
Inference Optimization
- Parallel span scoring (no autoregressive decoding)
- Greedy priority-queue decoding with O(n log n)
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Weaker performance on noisy social media (tweet datasets) compared to some baselines.
- Poor results on some non-Latin languages (e.g., Bengali F1 0.89) when only English fine-tuned.
- Relies on Pile-NER quality (ChatGPT-annotated) which can inherit annotation noise.
When Not To Use
- When you require best-in-class per-language supervised performance (use per-language finetuned models).
- For highly noisy or colloquial text (tweets) without domain-specific finetuning.
- If legal/traceability requirements forbid models trained on crowd/LLM-annotated corpora.
Failure Modes
- High false positives if trained without negative entity sampling (low precision).
- High false negatives if negative sampling is too aggressive (high recall loss).
- Low performance on unseen scripts and languages when English-only fine-tuning is used.
Core Entities
Models
- GLiNER-S (50M)
- GLiNER-M (90M)
- GLiNER-L (0.3B)
- deBERTa-v3
- mdeBERTa-v3
- BERT
- RoBERTa
- ALBERT
- ELECTRA
- UniNER
- InstructUIE
- GoLLIE
- USM
- ChatGPT
- Vicuna-7B
- Vicuna-13B
Metrics
- F1-score (exact span match)
Datasets
- Pile-NER (Pile-derived; ~44.9k passages, 240k spans, 13k types)
- OOD NER Benchmark (CrossNER + MIT)
- 20 NER datasets (diverse domains)
- Multiconer (Multilingual Complex NER)
Benchmarks
- OOD NER Benchmark
- 20 NER datasets
- Multiconer (multilingual)
Context Entities
Models
- LLM prompting baselines (ChatGPT, Vicuna)
- Large instruct tuned models (InstructUIE, GoLLIE)
- UniNER (LLaMa finetuned)
Datasets
- Pile (source of Pile-NER)
- CrossNER, MIT datasets (components of OOD benchmark)

