Use LLM token embeddings plus optional summarization to map job text to standardized occupation codes

September 18, 20236 min

Overview

Production Readiness

0.55

Novelty Score

0.65

Cost Impact Score

0.6

Citation Count

5

Authors

Nan Li, Bo Kang, Tijl De Bie

Links

Abstract / PDF

Why It Matters For Business

LLM4Jobs gives a practical unsupervised route to map job text to standard codes with better accuracy than off-the-shelf rule tools, lowering annotation cost and enabling downstream analytics and recommendation systems.

Summary TLDR

LLM4Jobs is an unsupervised two-phase pipeline that uses decoder-only LLMs to embed standardized occupation descriptions and job texts, then ranks codes by vector similarity. An optional LLM-based summarization step helps with long, noisy job postings. Evaluated on two GPT-4–generated synthetic datasets (GenEasy, GenHard) and a 100-sample manually annotated real-world dataset, LLM4Jobs (Vicuna-33B backbone) beats unsupervised baselines (CASCOT and GPT-4 zero-shot) across most metrics and granularities. Code and datasets are publicly released.

Problem Statement

Mapping free-text job postings and resumes to standardized occupation codes is hard because titles are noisy, descriptions are long and promotional, and labeled data is scarce. Existing unsupervised tools are rule- or keyword-based and perform poorly at fine-grained coding.

Main Contribution

LLM4Jobs: an unsupervised, two-phase method that embeds occupation definitions and queries with a decoder-only LLM and retrieves codes by vector similarity.

Showed that optional LLM-based summarization improves accuracy on long, noisy real job postings.

Released two synthetic datasets (GenEasy, GenHard) generated with GPT-4 and a 100-item manually annotated real-world dataset, plus open-source code.

Key Findings

LLM4Jobs outperforms unsupervised baselines on evaluated datasets.

NumbersGenEasy (Level 3) HR@1: LLM4Jobs 0.724 vs CASCOT 0.380 vs GPT-4 0.476

Summarization helps on real-world, long job postings.

NumbersReal-world Level 4 HR@1: adaptive summary truncation 0.390 vs no-summary 0.270

Performance drops at finer code granularity.

NumbersAcross methods, HR@1 decreases from Level 3 to Level 5+ (e.g., GenHard LLM4Jobs L3 0.576 -> L5+ 0.378)

Larger, human-aligned models work better but cost more.

NumbersVicuna-33B > Vicuna-13B/LLaMA-13B variants on multiple plots (figures 2–4)

Results

HR@1 (GenEasy, Level 3)

Value0.724

BaselineCASCOT 0.380, GPT-4 0.476

HR@1 (GenHard, Level 3)

Value0.576

BaselineCASCOT 0.224, GPT-4 0.354

HR@1 (Real-world, Level 4)

Value0.390

BaselineCASCOT 0.100, GPT-4 0.320

Who Should Care

What To Try In 7 Days

Run LLM4Jobs with an open Vicuna model on a small sample of your postings and compare top-5 codes vs existing rule tool.

Enable adaptive summarization for postings >300 words and measure change in top-1 accuracy.

Start with truncation mapping to return higher-level codes for production to reduce misclassification risk.

Reproducibility

Code Available

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Real-world evaluation set is small (100 samples), limiting generality.
  • Synthetic datasets were generated with GPT-4 and may not fully capture real-world nuance.
  • Fine-tuning and broader prompt engineering were not explored and might improve results.
  • Some taxonomic ambiguities remain where semantic embeddings conflate distinct codes.

When Not To Use

  • When you need guaranteed high accuracy on very fine-grained (level 5+) codes without human review.
  • Where legal or safety-critical decisions require certified coding and audited deterministic rules.
  • If you cannot host larger LLMs and need very low-latency, low-cost inference without cloud GPUs.

Failure Modes

  • Hallucination: zero-shot LLMs (GPT-4 baseline) sometimes suggest non-existent or incorrect detailed codes.
  • Semantic overlap: embeddings can place different high-level codes close together, producing top-k confusion.
  • Poor summarization: noisy summaries can fail to capture core duties and mislead retrieval.

Core Entities

Models

  • Vicuna-33B
  • Vicuna-13B
  • Vicuna-7B
  • LLaMA-2-13B
  • LLaMA-2-13B-chat
  • GPT-4
  • CASCOT

Metrics

  • HR@1
  • HR@5
  • HR@10
  • MRR@5
  • NDCG@5

Datasets

  • GenEasy (synthetic, 1000 samples)
  • GenHard (synthetic, 1000 samples, hard titles)
  • Real-world Indeed dataset (100 annotated job postings)
  • ESCO/ISCO taxonomies

Benchmarks

  • HR@k
  • MRR@k
  • NDCG@k