Fine-tune LLMs with map-context prompts to predict population and economic indicators

Overview

Decision SnapshotReady For Pilot

GeoLLM is practical now for many regional prediction tasks: it needs map-data prompts, modest labels, and benefits from larger LLMs; however, expect variation with jittered labels and when local features are only visible in imagery.

Citations28

Evidence Strength0.80

Confidence0.85

Risk Signals10

Trust Signals

Findings with numeric evidence: 5/5

Findings with evidence refs: 5/5

Results with explicit delta: 4/4

Reproducibility

Status: Code + data available

Open source: Partial

At A Glance

Cost impact: 65%

Production readiness: 70%

Novelty: 55%

Authors

Rohin Manvi, Samar Khanna, Gengchen Mai, Marshall Burke, David Lobell, Stefano Ermon

Links

Abstract / PDF / Code / Data

Why It Matters For Business

GeoLLM provides a low-cost geospatial signal from pretrained LLMs that can match or beat satellite nightlight baselines and work with hundreds to thousands of labels, making it useful where imagery is costly or missing.

Who Should Care

CTO ML Engineer Data Scientist Product Manager Founder

Summary TLDR

GeoLLM is a simple, prompt-driven fine-tuning recipe that extracts location knowledge already stored in large language models (LLMs). By adding a reverse-geocoded address and a short list of nearby places (from OpenStreetMap) to a prompt, fine-tuned LLMs (GPT-3.5, Llama 2, RoBERTa) predict population density, asset wealth, home value and related indicators with high accuracy. GPT-3.5 achieves Pearson r^2 up to ~0.78 on population tasks and outperforms classical baselines and a nightlight satellite baseline on multiple real datasets. The method is sample-efficient, geographically consistent, and works best when prompts include nearby places.

Problem Statement

Geospatial ML often needs expensive or incomplete covariates (satellite imagery, phone data). The paper asks: can knowledge already stored inside LLMs be extracted and used as low-cost geospatial covariates to predict location-level outcomes like population density and asset wealth?

Main Contribution

Show that pre-trained LLMs contain usable geospatial knowledge and that careful prompting unlocks it.

Introduce GeoLLM: fine-tune LLMs on prompts that include coordinates, reverse-geocoded address, and closest nearby places from OpenStreetMap.

Key Findings

GeoLLM yields large gains over prompt-based and classic baselines on real geospatial tasks.

Numbers70% improvement in Pearson's r^2 vs nearest-neighbor/XGBoost baselines (paper claim)

Practical UseIf you fine-tune an LLM with map-context prompts, expect substantially better correlation with ground truth than simple geospatial baselines on similar tasks.

Evidence RefAbstract; Conclusions

GPT-3.5 is the best model tested and outperforms smaller LLMs.

NumbersGPT-3.5 > Llama 2 by 19% and > RoBERTa by 51% in mean r^2 across evaluated tasks

Practical UseUse larger, better-pretrained LLMs when possible — performance scales with model/pretraining size.

Evidence RefAbstract; Section 4.3; Fig.3

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Pearson's r^2 (Population, 10k samples)	0.78 (GPT-3.5)	k-NN 0.36; Nightlight 0.68	≈ +15–40 points vs baselines	WorldPop / global	Table 1: Population 10,000 samples row	Table 1
Pearson's r^2 (Asset Wealth, 10k samples)	0.75 (GPT-3.5)	Nightlight 0.55; XGBoost-FT 0.64	≈ +10–20 points vs strong baselines	DHS (SustainBench)	Table 1 asset wealth row	Table 1

What To Try In 7 Days

Fine-tune a hosted LLM (e.g., GPT-3.5) on 1,000 labeled points using prompts that include reverse-geocoded address + 10 nearby places.

Compare predictions to a nightlight baseline and a simple XGBoost on prompt fields to check uplift.

Run an ablation: drop nearby-places from prompts to confirm sensitivity and measure error change.

Optimization Features

Token Efficiency

Use concise prompt template to limit token length

Model Optimization

LoRA

Training Optimization

LoRAMixed precision and gradient checkpointing for Llama 2

Reproducibility

Code AvailableYes

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Code URLs

https://rohinmanvi.github.io/GeoLLM

Data URLs

https://www.worldpop.org https://www.dhsprogram.com https://data.census.gov https://www.zillow.com/research/data/

Risks & Boundaries

Limitations

Performance drops when dataset labels are spatially 'jittered' (DHS coordinate noise).

Method relies on quality of map data; OpenStreetMap is free but noisier than paid maps.

When Not To Use

When you need per-building or street-level visual detail only visible in imagery.

When exact, auditable data provenance is required from primary sensors.

Failure Modes

Hallucination or incorrect mapping of coordinates if prompt context is poor.

Bias: worse absolute error in underrepresented or sparsely populated regions.

Core Entities

Models

GPT-3.5Llama 2 (7B)RoBERTa (base)GPT-2LoRA

Metrics

Pearson's r^2Mean absolute error (MAE)

Datasets

WorldPopDHS (SustainBench)USCB (Census/ACS)Zillow (ZHVI)

Benchmarks

GeoLLM benchmark (this paper)SustainBenchWorldPop population mosaic

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

GeoLLM yields large gains over prompt-based and classic baselines on real geospatial tasks.

GPT-3.5 is the best model tested and outperforms smaller LLMs.

Results

What To Try In 7 Days

Optimization Features

Reproducibility

Code URLs

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding