Overview
Production Readiness
0.55
Novelty Score
0.65
Cost Impact Score
0.6
Citation Count
5
Why It Matters For Business
LLM4Jobs gives a practical unsupervised route to map job text to standard codes with better accuracy than off-the-shelf rule tools, lowering annotation cost and enabling downstream analytics and recommendation systems.
Summary TLDR
LLM4Jobs is an unsupervised two-phase pipeline that uses decoder-only LLMs to embed standardized occupation descriptions and job texts, then ranks codes by vector similarity. An optional LLM-based summarization step helps with long, noisy job postings. Evaluated on two GPT-4–generated synthetic datasets (GenEasy, GenHard) and a 100-sample manually annotated real-world dataset, LLM4Jobs (Vicuna-33B backbone) beats unsupervised baselines (CASCOT and GPT-4 zero-shot) across most metrics and granularities. Code and datasets are publicly released.
Problem Statement
Mapping free-text job postings and resumes to standardized occupation codes is hard because titles are noisy, descriptions are long and promotional, and labeled data is scarce. Existing unsupervised tools are rule- or keyword-based and perform poorly at fine-grained coding.
Main Contribution
LLM4Jobs: an unsupervised, two-phase method that embeds occupation definitions and queries with a decoder-only LLM and retrieves codes by vector similarity.
Showed that optional LLM-based summarization improves accuracy on long, noisy real job postings.
Released two synthetic datasets (GenEasy, GenHard) generated with GPT-4 and a 100-item manually annotated real-world dataset, plus open-source code.
Key Findings
LLM4Jobs outperforms unsupervised baselines on evaluated datasets.
Summarization helps on real-world, long job postings.
Performance drops at finer code granularity.
Larger, human-aligned models work better but cost more.
Results
HR@1 (GenEasy, Level 3)
HR@1 (GenHard, Level 3)
HR@1 (Real-world, Level 4)
Who Should Care
What To Try In 7 Days
Run LLM4Jobs with an open Vicuna model on a small sample of your postings and compare top-5 codes vs existing rule tool.
Enable adaptive summarization for postings >300 words and measure change in top-1 accuracy.
Start with truncation mapping to return higher-level codes for production to reduce misclassification risk.
Reproducibility
Code Urls
Code Available
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Real-world evaluation set is small (100 samples), limiting generality.
- Synthetic datasets were generated with GPT-4 and may not fully capture real-world nuance.
- Fine-tuning and broader prompt engineering were not explored and might improve results.
- Some taxonomic ambiguities remain where semantic embeddings conflate distinct codes.
When Not To Use
- When you need guaranteed high accuracy on very fine-grained (level 5+) codes without human review.
- Where legal or safety-critical decisions require certified coding and audited deterministic rules.
- If you cannot host larger LLMs and need very low-latency, low-cost inference without cloud GPUs.
Failure Modes
- Hallucination: zero-shot LLMs (GPT-4 baseline) sometimes suggest non-existent or incorrect detailed codes.
- Semantic overlap: embeddings can place different high-level codes close together, producing top-k confusion.
- Poor summarization: noisy summaries can fail to capture core duties and mislead retrieval.
Core Entities
Models
- Vicuna-33B
- Vicuna-13B
- Vicuna-7B
- LLaMA-2-13B
- LLaMA-2-13B-chat
- GPT-4
- CASCOT
Metrics
- HR@1
- HR@5
- HR@10
- MRR@5
- NDCG@5
Datasets
- GenEasy (synthetic, 1000 samples)
- GenHard (synthetic, 1000 samples, hard titles)
- Real-world Indeed dataset (100 annotated job postings)
- ESCO/ISCO taxonomies
Benchmarks
- HR@k
- MRR@k
- NDCG@k

