Use LLM token embeddings plus optional summarization to map job text to standardized occupation codes

September 18, 20236 min

Overview

Decision SnapshotNeeds Validation

The method is simple to run with open LLMs and shows consistent gains across synthetic and small real datasets; however, evidence is limited by a small real dataset and synthetic data quirks.

Citations5

Evidence Strength0.70

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 55%

Novelty: 65%

Authors

Nan Li, Bo Kang, Tijl De Bie

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLM4Jobs gives a practical unsupervised route to map job text to standard codes with better accuracy than off-the-shelf rule tools, lowering annotation cost and enabling downstream analytics and recommendation systems.

Who Should Care

Summary TLDR

LLM4Jobs is an unsupervised two-phase pipeline that uses decoder-only LLMs to embed standardized occupation descriptions and job texts, then ranks codes by vector similarity. An optional LLM-based summarization step helps with long, noisy job postings. Evaluated on two GPT-4–generated synthetic datasets (GenEasy, GenHard) and a 100-sample manually annotated real-world dataset, LLM4Jobs (Vicuna-33B backbone) beats unsupervised baselines (CASCOT and GPT-4 zero-shot) across most metrics and granularities. Code and datasets are publicly released.

Problem Statement

Mapping free-text job postings and resumes to standardized occupation codes is hard because titles are noisy, descriptions are long and promotional, and labeled data is scarce. Existing unsupervised tools are rule- or keyword-based and perform poorly at fine-grained coding.

Main Contribution

LLM4Jobs: an unsupervised, two-phase method that embeds occupation definitions and queries with a decoder-only LLM and retrieves codes by vector similarity.

Showed that optional LLM-based summarization improves accuracy on long, noisy real job postings.

Key Findings

LLM4Jobs outperforms unsupervised baselines on evaluated datasets.

NumbersGenEasy (Level 3) HR@1: LLM4Jobs 0.724 vs CASCOT 0.380 vs GPT-4 0.476

Practical UseUse LLM4Jobs instead of CASCOT or GPT-4 zero-shot for unsupervised occupation coding when labeled data is limited.

Evidence RefTable 2

Summarization helps on real-world, long job postings.

NumbersReal-world Level 4 HR@1: adaptive summary truncation 0.390 vs no-summary 0.270

Practical UseAdd an LLM summarization step for long postings (>=300 words) to boost accuracy.

Evidence RefTable 1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
HR@1 (GenEasy, Level 3)0.724CASCOT 0.380, GPT-4 0.476LLM4Jobs +0.344 vs CASCOTGenEasyTable 2, LLM4Jobs truncationTable 2
HR@1 (GenHard, Level 3)0.576CASCOT 0.224, GPT-4 0.354LLM4Jobs +0.352 vs CASCOTGenHardTable 3, LLM4Jobs truncationTable 3

What To Try In 7 Days

Run LLM4Jobs with an open Vicuna model on a small sample of your postings and compare top-5 codes vs existing rule tool.

Enable adaptive summarization for postings >300 words and measure change in top-1 accuracy.

Start with truncation mapping to return higher-level codes for production to reduce misclassification risk.

Reproducibility

Risks & Boundaries

Limitations

Real-world evaluation set is small (100 samples), limiting generality.

Synthetic datasets were generated with GPT-4 and may not fully capture real-world nuance.

When Not To Use

When you need guaranteed high accuracy on very fine-grained (level 5+) codes without human review.

Where legal or safety-critical decisions require certified coding and audited deterministic rules.

Failure Modes

Hallucination: zero-shot LLMs (GPT-4 baseline) sometimes suggest non-existent or incorrect detailed codes.

Semantic overlap: embeddings can place different high-level codes close together, producing top-k confusion.

Core Entities

Models

Vicuna-33BVicuna-13BVicuna-7BLLaMA-2-13BLLaMA-2-13B-chatGPT-4CASCOT

Metrics

HR@1HR@5HR@10MRR@5NDCG@5

Datasets

GenEasy (synthetic, 1000 samples)GenHard (synthetic, 1000 samples, hard titles)Real-world Indeed dataset (100 annotated job postings)ESCO/ISCO taxonomies

Benchmarks

HR@kMRR@kNDCG@k