UrbanKGent: an LLM agent that builds city-scale knowledge graphs cheaper and more accurately using geospatial tools

Overview

Decision SnapshotReady For Pilot

The method is practical: distilled GPT-4 reasoning plus tool calls yield reproducible gains and large cost savings, but expect domain-specific cleanup and human validation before full production.

Citations3

Evidence Strength0.85

Confidence0.82

Risk Signals9

Trust Signals

Findings with numeric evidence: 3/3

Findings with evidence refs: 3/3

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Yes

At A Glance

Cost impact: 80%

Production readiness: 70%

Novelty: 65%

Authors

Yansong Ning, Hao Liu

Links

Abstract / PDF / Code / Data

Why It Matters For Business

UrbanKGent lets teams build large, practical city knowledge graphs with small open models, cutting inference costs roughly 20× and lowering data needs, so you can deploy KG-driven city apps faster and cheaper.

Who Should Care

Product Manager ML Engineer Data Scientist CTO Founder

Summary TLDR

UrbanKGent is an LLM-agent pipeline that turns raw urban text and geo-data into large urban knowledge graphs. It builds city-scale graphs by: (1) creating heterogeneity-aware, geospatial-infused instructions; (2) distilling GPT-4 chain-of-thought trajectories and refining them with external geospatial tools; (3) fine-tuning Llama-family models with LoRA. The result: fine-tuned 7/8/13B agents that match or beat GPT-4 on urban triplet extraction and relation completion, cut inference cost ≈20×, and construct comparable UrbanKGs using ≈20% of the original data volume.

Problem Statement

Building urban knowledge graphs is labor-intensive and brittle: prior pipelines need hand-crafted rules or expensive annotated corpora. Off‑the‑shelf LLMs struggle with heterogeneous urban relations (spatial, temporal, functional) and with geospatial computation (distance, containment). The paper asks: can a domain-tailored LLM agent combine knowledge-aware prompts, geospatial tools, and distilled reasoning to automate UrbanKG construction cost-effectively?

Main Contribution

UrbanKGent: an end-to-end LLM agent framework that combines heterogeneity-aware instructions, geospatial tool calls, trajectory refinement, and hybrid fine-tuning to build UrbanKGs.

A geospatial-infused instruction set plus a tool-augmented iterative trajectory refinement method that distills GPT-4 chains-of-thought into faithful training trajectories.

Key Findings

Fine-tuned UrbanKGent-13B outperforms GPT-4 on UrbanKGC accuracy on evaluated datasets.

NumbersNYC: +~15% (RTE) and +~14% (KGC) accuracy vs GPT-4 on evaluated splits

Practical UseSmall open models, when instruction‑tuned with distilled and refined GPT-4 trajectories, can beat GPT-4 on urban relation extraction—so prioritize domain fine-tuning over always calling larger APIs.

Evidence RefTable 3 (UrbanKGent-13B vs GPT-4)

UrbanKGent constructs UrbanKGs at similar scale using far less input data.

NumbersConstructed UrbanKG with similar #entities/triplets using ~20% of data used by prior benchmark

Practical UseYou can reduce data collection effort: run the UrbanKGent pipeline on a smaller curated corpus and still recover large, rich UrbanKGs.

Evidence RefTable 4 (NYC-Large/CHI-Large vs UUKG)

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	0.55	Zero-shot (ZSL) human 0.42	+0.13	NYC-Large	Table 9: UrbanKGent-13B human RTE 0.55 vs ZSL 0.42	Table 9
Accuracy	0.46	Zero-shot (ZSL) human 0.31	+0.15	NYC-Large	Table 9: UrbanKGent-13B human KGC 0.46 vs ZSL 0.31	Table 9

What To Try In 7 Days

Run a quick pilot: fine-tune a Llama-7B model with LoRA on a few hundred domain instructions distilled from GPT-4.

Wrap simple geospatial utilities (distance, contains, intersects) and call them from your prompts to handle geometry accurately.

Validate outputs on 200 samples via human labels and GPT-4 evaluation to measure accuracy and calibrate filters.

Agent Features

Memory

trajectory distillation (saved CoT steps used as instruction targets)

Planning

iterative self-refinement (verifier + updater)multi-turn multi-view instruction dialogs

Tool Use

external geospatial toolkit (distance, containment, geohash, intersection)self-programmed tool interfaces generated via GPT-4

Frameworks

FireAct-style reasoning distillationLoRA

Is Agentic

Yes

Architectures

LLM agent pipeline (Llama-family fine-tuned)chain-of-thought distillation

Collaboration

uses GPT-4 both as trajectory teacher and as an automatic evaluator

Optimization Features

Token Efficiency

reduces GPT API dependence by using local LLMs

Infra Optimization

counts GPU runtime to estimate cost vs GPT API

Model Optimization

LoRA

System Optimization

A800 GPUs for batch inference

Training Optimization

hybrid instruction fine-tuning on distilled and refined GPT-4 trajectoriesmulti-view instruction mixture training

Inference Optimization

deploy smaller fine-tuned models (7B/8B/13B) instead of calling GPT-4use task-specific prompts to reduce unnecessary tokens

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusYes

LicenseUnknown

Code URLs

https://github.com/usail-hkust/UrbanKGent https://huggingface.co/usail-hkust/UrbanKGent-7B https://huggingface.co/usail-hkust/UrbanKGent-8B https://huggingface.co/usail-hkust/UrbanKGent-13B

Data URLs

NYC/CHI data collected from public sources (NYC.gov, Chicago.gov, OpenStreetMap, Google Maps, Wikipedia, C4); dataset files referenced in repo

Risks & Boundaries

Limitations

Evaluation relies heavily on GPT-4 self-evaluation which, while correlated with humans, is not flawless.

Applications shown are limited to two cities; generality to other urban contexts needs testing.

When Not To Use

When you need provable, traceable geospatial decisions without LLM ambiguity.

In safety-critical deployments before human-in-the-loop validation and legal review.

Failure Modes

LLM hallucinations producing incorrect triplets from noisy web text.

Tool invocation errors or mis-integration leading to wrong geospatial inferences.

Core Entities

Models

UrbanKGent-7BUrbanKGent-8BUrbanKGent-13BLlama-2-7BLlama-2-13BLlama-3-8BLlama-2-70BLlama-3-70BGPT-3.5GPT-4Vicuna-7BAlpaca-7BMistral-7B

Metrics

AccuracyGPT-4 confidenceinference latency (minutes per dataset)inference cost (USD per 1,000 tasks)

Datasets

NYC-InstructCHI-InstructNYCCHINYC-LargeCHI-LargeUUKG (benchmark)

Benchmarks

UUKG

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

Fine-tuned UrbanKGent-13B outperforms GPT-4 on UrbanKGC accuracy on evaluated datasets.

UrbanKGent constructs UrbanKGs at similar scale using far less input data.

Results

What To Try In 7 Days

Agent Features

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Systematizes reusable 'agentic skills' for LLM agents, their lifecycle, design patterns, risks, and evaluation

Key finding

ETAPP: an 800-case sandbox benchmark and key-point LLM evaluator for personalized tool use

Key finding

TOOLMAKER: agents that turn scientific GitHub repos into executable LLM tools

Key finding

ToolBH: a multi-level benchmark that finds tool-use hallucinations in LLMs

Key finding

Let two agents use different retrieval tools and iteratively query the web to cut hallucinations in fact-checking

Key finding