Overview
Production Readiness
0.6
Novelty Score
0.4
Cost Impact Score
0.6
Citation Count
4
Why It Matters For Business
Model choice and prompt style strongly change outcomes on spatial tasks; careful selection plus prompt tuning turns unusable answers into operable outputs.
Summary TLDR
The authors built a 900-question spatial benchmark across 12 GIS task types and evaluated six LLMs (gpt-3.5-turbo, gpt-4-turbo-2024-04-09, gpt-4o, claude-3-sonnet-20240229, moonshot-v1-8k, glm-4). Zero-shot results show gpt-4o best (WA 71.3%), gpt-4-turbo 69.7%, glm-4 62.4%, claude 62.1%, moonshot 53.2%, gpt-3.5 43.8%. Models do well on factual and code-explanation tasks but fail on complex reasoning like route planning (gpt-4o zero-shot 12.4%). Prompt tuning fixes specific failures: CoT raised gpt-4o route-planning to 87.5%; one-shot raised moonshot NL2API mapping from 10.1% to 76.3%. The dataset is text-only and expert-validated. Use this benchmark to choose models and prompt strategies by
Problem Statement
We lack a systematic, multi-task benchmark that measures LLM abilities on practical spatial/GIS problems. The paper builds a 900-question dataset across 12 spatial task types and tests several leading models and prompt strategies to reveal capability gaps and prompt sensitivity.
Main Contribution
A 900-question, expert-validated spatial dataset covering 12 task types (foundational, analysis, application)
Two-phase evaluation of six LLMs: zero-shot then targeted prompt tuning (One-shot, Combined, CoT, Zero-shot-CoT)
A difficulty-based split (easy/medium/difficult) derived from model performance
A weighted accuracy (WA) metric and per-task scoring (0/1/2) for consistent comparison
Concrete examples showing prompt strategies can convert failure into usable results
Key Findings
gpt-4o leads in zero-shot overall accuracy across the 900-question benchmark.
All models excel at conceptual and code-explanation tasks.
Models generally fail at simple route planning and other high-level spatial reasoning in zero-shot.
Chain-of-Thought (CoT) prompting can convert route-planning failure into success for some models.
One-shot prompting can massively improve NL2API mapping for weaker models.
Toponym (place-name) recognition varies by model; lightweight models can excel.
Dataset difficulty split highlights where models improve or fail.
Results
Accuracy
Simple route planning WA (zero-shot vs tuned)
NL2API Mapping (first type: Mapbox link) - one-shot gains
Per-category extremes
Who Should Care
What To Try In 7 Days
Run gpt-4o on your text-only GIS QA and code-review tasks to check baseline WA ~70%
For sequential spatial reasoning, test Chain-of-Thought prompts and validate outputs automatically
If using a weaker model for API-generation, give one clear example (one-shot) to align outputs with your API format
Reproducibility
Data Available
Open Source Status
- partial
Risks & Boundaries
Limitations
- Dataset is text-only; no multimodal (image/map) test cases.
- Task categories are not exhaustive (e.g., POI recommendation, vector analysis omitted).
- Single-turn testing, one sample per question, temperature set to 1 — this may understate variability.
- Prompt strategies tested are limited; more advanced sampling or decomposition methods were not tried.
When Not To Use
- Do not use zero-shot outputs for live route planning or navigation without verification.
- Do not treat this dataset as a full multimodal GIS benchmark.
- Do not assume results transfer to image/map inputs or multi-turn agent settings.
Failure Modes
- Hallucinated or non-executable code in code-generation tasks if not validated.
- Incorrect sequential reasoning for route planning without CoT prompts.
- Inconsistent toponym disambiguation across models and local place names.
- Model-specific negative responses to some prompt strategies (performance can drop).
Core Entities
Models
- gpt-3.5-turbo
- gpt-4-turbo-2024-04-09
- gpt-4o
- claude-3-sonnet-20240229
- moonshot-v1-8k
- glm-4
Metrics
- Accuracy
- Count S2/S1/S0 (score counts)
- Per-category WA
Datasets
- 900-question spatial dataset (this paper, 12 categories)
- CALVIN (used as source data)
- PPNL (path planning benchmark referenced)
- GRASP (referenced)
- STBench (referenced)
- CityBench (referenced)
Benchmarks
- This paper's 900-question spatial benchmark
- CALVIN
- PPNL
- GRASP
- STBench
- CityBench

