Fine-tune LLMs with map-context prompts to predict population and economic indicators

October 10, 20238 min

Overview

Decision SnapshotReady For Pilot

GeoLLM is practical now for many regional prediction tasks: it needs map-data prompts, modest labels, and benefits from larger LLMs; however, expect variation with jittered labels and when local features are only visible in imagery.

Citations28

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 70%

Novelty: 55%

Authors

Rohin Manvi, Samar Khanna, Gengchen Mai, Marshall Burke, David Lobell, Stefano Ermon

Links

Abstract / PDF / Code / Data

Why It Matters For Business

GeoLLM provides a low-cost geospatial signal from pretrained LLMs that can match or beat satellite nightlight baselines and work with hundreds to thousands of labels, making it useful where imagery is costly or missing.

Who Should Care

Summary TLDR

GeoLLM is a simple, prompt-driven fine-tuning recipe that extracts location knowledge already stored in large language models (LLMs). By adding a reverse-geocoded address and a short list of nearby places (from OpenStreetMap) to a prompt, fine-tuned LLMs (GPT-3.5, Llama 2, RoBERTa) predict population density, asset wealth, home value and related indicators with high accuracy. GPT-3.5 achieves Pearson r^2 up to ~0.78 on population tasks and outperforms classical baselines and a nightlight satellite baseline on multiple real datasets. The method is sample-efficient, geographically consistent, and works best when prompts include nearby places.

Problem Statement

Geospatial ML often needs expensive or incomplete covariates (satellite imagery, phone data). The paper asks: can knowledge already stored inside LLMs be extracted and used as low-cost geospatial covariates to predict location-level outcomes like population density and asset wealth?

Main Contribution

Show that pre-trained LLMs contain usable geospatial knowledge and that careful prompting unlocks it.

Introduce GeoLLM: fine-tune LLMs on prompts that include coordinates, reverse-geocoded address, and closest nearby places from OpenStreetMap.

Key Findings

GeoLLM yields large gains over prompt-based and classic baselines on real geospatial tasks.

Numbers70% improvement in Pearson's r^2 vs nearest-neighbor/XGBoost baselines (paper claim)

Practical UseIf you fine-tune an LLM with map-context prompts, expect substantially better correlation with ground truth than simple geospatial baselines on similar tasks.

Evidence RefAbstract; Conclusions

GPT-3.5 is the best model tested and outperforms smaller LLMs.

NumbersGPT-3.5 > Llama 2 by 19% and > RoBERTa by 51% in mean r^2 across evaluated tasks

Practical UseUse larger, better-pretrained LLMs when possible — performance scales with model/pretraining size.

Evidence RefAbstract; Section 4.3; Fig.3

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Pearson's r^2 (Population, 10k samples)0.78 (GPT-3.5)k-NN 0.36; Nightlight 0.68≈ +1540 points vs baselinesWorldPop / globalTable 1: Population 10,000 samples rowTable 1
Pearson's r^2 (Asset Wealth, 10k samples)0.75 (GPT-3.5)Nightlight 0.55; XGBoost-FT 0.64≈ +1020 points vs strong baselinesDHS (SustainBench)Table 1 asset wealth rowTable 1

What To Try In 7 Days

Fine-tune a hosted LLM (e.g., GPT-3.5) on 1,000 labeled points using prompts that include reverse-geocoded address + 10 nearby places.

Compare predictions to a nightlight baseline and a simple XGBoost on prompt fields to check uplift.

Run an ablation: drop nearby-places from prompts to confirm sensitivity and measure error change.

Optimization Features

Token Efficiency
Use concise prompt template to limit token length
Model Optimization
LoRA
Training Optimization
LoRAMixed precision and gradient checkpointing for Llama 2

Reproducibility

Risks & Boundaries

Limitations

Performance drops when dataset labels are spatially 'jittered' (DHS coordinate noise).

Method relies on quality of map data; OpenStreetMap is free but noisier than paid maps.

When Not To Use

When you need per-building or street-level visual detail only visible in imagery.

When exact, auditable data provenance is required from primary sensors.

Failure Modes

Hallucination or incorrect mapping of coordinates if prompt context is poor.

Bias: worse absolute error in underrepresented or sparsely populated regions.

Core Entities

Models

GPT-3.5Llama 2 (7B)RoBERTa (base)GPT-2LoRA

Metrics

Pearson's r^2Mean absolute error (MAE)

Datasets

WorldPopDHS (SustainBench)USCB (Census/ACS)Zillow (ZHVI)

Benchmarks

GeoLLM benchmark (this paper)SustainBenchWorldPop population mosaic