Use LLM token embeddings plus optional summarization to map job text to standardized occupation codes

Overview

Decision SnapshotNeeds Validation

The method is simple to run with open LLMs and shows consistent gains across synthetic and small real datasets; however, evidence is limited by a small real dataset and synthetic data quirks.

Citations5

Evidence Strength0.70

Confidence0.78

Risk Signals10

Trust Signals

Findings with numeric evidence: 4/4

Findings with evidence refs: 4/4

Results with explicit delta: 3/3

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 55%

Novelty: 65%

Authors

Nan Li, Bo Kang, Tijl De Bie

Links

Abstract / PDF / Code / Data

Why It Matters For Business

LLM4Jobs gives a practical unsupervised route to map job text to standard codes with better accuracy than off-the-shelf rule tools, lowering annotation cost and enabling downstream analytics and recommendation systems.

Who Should Care

Product Manager ML Engineer Data Scientist Founder

Summary TLDR

LLM4Jobs is an unsupervised two-phase pipeline that uses decoder-only LLMs to embed standardized occupation descriptions and job texts, then ranks codes by vector similarity. An optional LLM-based summarization step helps with long, noisy job postings. Evaluated on two GPT-4–generated synthetic datasets (GenEasy, GenHard) and a 100-sample manually annotated real-world dataset, LLM4Jobs (Vicuna-33B backbone) beats unsupervised baselines (CASCOT and GPT-4 zero-shot) across most metrics and granularities. Code and datasets are publicly released.

Problem Statement

Mapping free-text job postings and resumes to standardized occupation codes is hard because titles are noisy, descriptions are long and promotional, and labeled data is scarce. Existing unsupervised tools are rule- or keyword-based and perform poorly at fine-grained coding.

Main Contribution

LLM4Jobs: an unsupervised, two-phase method that embeds occupation definitions and queries with a decoder-only LLM and retrieves codes by vector similarity.

Showed that optional LLM-based summarization improves accuracy on long, noisy real job postings.

Key Findings

LLM4Jobs outperforms unsupervised baselines on evaluated datasets.

NumbersGenEasy (Level 3) HR@1: LLM4Jobs 0.724 vs CASCOT 0.380 vs GPT-4 0.476

Practical UseUse LLM4Jobs instead of CASCOT or GPT-4 zero-shot for unsupervised occupation coding when labeled data is limited.

Evidence RefTable 2

Summarization helps on real-world, long job postings.

NumbersReal-world Level 4 HR@1: adaptive summary truncation 0.390 vs no-summary 0.270

Practical UseAdd an LLM summarization step for long postings (>=300 words) to boost accuracy.

Evidence RefTable 1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
HR@1 (GenEasy, Level 3)	0.724	CASCOT 0.380, GPT-4 0.476	LLM4Jobs +0.344 vs CASCOT	GenEasy	Table 2, LLM4Jobs truncation	Table 2
HR@1 (GenHard, Level 3)	0.576	CASCOT 0.224, GPT-4 0.354	LLM4Jobs +0.352 vs CASCOT	GenHard	Table 3, LLM4Jobs truncation	Table 3

What To Try In 7 Days

Run LLM4Jobs with an open Vicuna model on a small sample of your postings and compare top-5 codes vs existing rule tool.

Enable adaptive summarization for postings >300 words and measure change in top-1 accuracy.

Start with truncation mapping to return higher-level codes for production to reduce misclassification risk.

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://github.com/aida-ugent/SkillGPT https://github.com/aida-ugent/Occupation coding datasets

Data URLs

https://github.com/aida-ugent/Occupation coding datasets

Risks & Boundaries

Limitations

Real-world evaluation set is small (100 samples), limiting generality.

Synthetic datasets were generated with GPT-4 and may not fully capture real-world nuance.

When Not To Use

When you need guaranteed high accuracy on very fine-grained (level 5+) codes without human review.

Where legal or safety-critical decisions require certified coding and audited deterministic rules.

Failure Modes

Hallucination: zero-shot LLMs (GPT-4 baseline) sometimes suggest non-existent or incorrect detailed codes.

Semantic overlap: embeddings can place different high-level codes close together, producing top-k confusion.

Core Entities

Models

Vicuna-33BVicuna-13BVicuna-7BLLaMA-2-13BLLaMA-2-13B-chatGPT-4CASCOT

Metrics

HR@1HR@5HR@10MRR@5NDCG@5

Datasets

GenEasy (synthetic, 1000 samples)GenHard (synthetic, 1000 samples, hard titles)Real-world Indeed dataset (100 annotated job postings)ESCO/ISCO taxonomies

Benchmarks

HR@kMRR@kNDCG@k

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

LLM4Jobs outperforms unsupervised baselines on evaluated datasets.

Summarization helps on real-world, long job postings.

Results

What To Try In 7 Days

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding