900-question spatial benchmark finds gpt-4o leads; Chain-of-Thought and one-shot prompts can sharply boost performance

August 26, 20248 min

Overview

Production Readiness

0.6

Novelty Score

0.4

Cost Impact Score

0.6

Citation Count

4

Authors

Liuchang Xu, Shuo Zhao, Qingming Lin, Luyao Chen, Qianqian Luo, Sensen Wu, Xinyue Ye, Hailin Feng, Zhenhong Du

Links

Abstract / PDF

Why It Matters For Business

Model choice and prompt style strongly change outcomes on spatial tasks; careful selection plus prompt tuning turns unusable answers into operable outputs.

Summary TLDR

The authors built a 900-question spatial benchmark across 12 GIS task types and evaluated six LLMs (gpt-3.5-turbo, gpt-4-turbo-2024-04-09, gpt-4o, claude-3-sonnet-20240229, moonshot-v1-8k, glm-4). Zero-shot results show gpt-4o best (WA 71.3%), gpt-4-turbo 69.7%, glm-4 62.4%, claude 62.1%, moonshot 53.2%, gpt-3.5 43.8%. Models do well on factual and code-explanation tasks but fail on complex reasoning like route planning (gpt-4o zero-shot 12.4%). Prompt tuning fixes specific failures: CoT raised gpt-4o route-planning to 87.5%; one-shot raised moonshot NL2API mapping from 10.1% to 76.3%. The dataset is text-only and expert-validated. Use this benchmark to choose models and prompt strategies by

Problem Statement

We lack a systematic, multi-task benchmark that measures LLM abilities on practical spatial/GIS problems. The paper builds a 900-question dataset across 12 spatial task types and tests several leading models and prompt strategies to reveal capability gaps and prompt sensitivity.

Main Contribution

A 900-question, expert-validated spatial dataset covering 12 task types (foundational, analysis, application)

Two-phase evaluation of six LLMs: zero-shot then targeted prompt tuning (One-shot, Combined, CoT, Zero-shot-CoT)

A difficulty-based split (easy/medium/difficult) derived from model performance

A weighted accuracy (WA) metric and per-task scoring (0/1/2) for consistent comparison

Concrete examples showing prompt strategies can convert failure into usable results

Key Findings

gpt-4o leads in zero-shot overall accuracy across the 900-question benchmark.

Numbersgpt-4o WA = 71.3% (Table 1)

All models excel at conceptual and code-explanation tasks.

NumbersCode explanation WA = 100% for gpt-4o/gpt-4-turbo/claude (Table 2)

Models generally fail at simple route planning and other high-level spatial reasoning in zero-shot.

NumbersSimple route planning WA: gpt-4o 12.4%, average ≈ 4–12% (Table 2)

Chain-of-Thought (CoT) prompting can convert route-planning failure into success for some models.

Numbersgpt-4o route-planning: 12.4% → 87.5% with CoT (Section 4.3)

One-shot prompting can massively improve NL2API mapping for weaker models.

Numbersmoonshot NL2API mapping: 10.1% → 76.3% with One-shot (Section 4.3)

Toponym (place-name) recognition varies by model; lightweight models can excel.

NumbersToponym WA: moonshot 98%, gpt-4o 92%, glm-4 72% (Table 2)

Dataset difficulty split highlights where models improve or fail.

NumbersEasy/Medium/Difficult counts: 395/275/230; gpt-4o accuracy on difficult ≈ 0.13 (Section 4.2, Table 3)

Results

Accuracy

Valuegpt-4o 71.3%; gpt-4-turbo 69.7%; glm-4 62.4%; claude 62.1%; moonshot 53.2%; gpt-3.5 43.8%

Simple route planning WA (zero-shot vs tuned)

Valuegpt-4o: 12.4% → 87.5% with CoT; gpt-4-turbo: 8.7% → higher with CoT (Section 4.3)

Baselinezero-shot per Table 2

NL2API Mapping (first type: Mapbox link) - one-shot gains

Valuemoonshot: 10.1% → 76.3% with One-shot; gpt-4o zero-shot 61.9%, one-shot/combined up to 84.7%

Baselinezero-shot WA (Table 2 / Table 4)

Per-category extremes

ValueCode explanation WA = 100% (G4o/G4t/Claude); Spatial understanding low for many zero-shot (gpt-3.5 1.8%)

Who Should Care

What To Try In 7 Days

Run gpt-4o on your text-only GIS QA and code-review tasks to check baseline WA ~70%

For sequential spatial reasoning, test Chain-of-Thought prompts and validate outputs automatically

If using a weaker model for API-generation, give one clear example (one-shot) to align outputs with your API format

Reproducibility

Data Available

Open Source Status

  • partial

Risks & Boundaries

Limitations

  • Dataset is text-only; no multimodal (image/map) test cases.
  • Task categories are not exhaustive (e.g., POI recommendation, vector analysis omitted).
  • Single-turn testing, one sample per question, temperature set to 1 — this may understate variability.
  • Prompt strategies tested are limited; more advanced sampling or decomposition methods were not tried.

When Not To Use

  • Do not use zero-shot outputs for live route planning or navigation without verification.
  • Do not treat this dataset as a full multimodal GIS benchmark.
  • Do not assume results transfer to image/map inputs or multi-turn agent settings.

Failure Modes

  • Hallucinated or non-executable code in code-generation tasks if not validated.
  • Incorrect sequential reasoning for route planning without CoT prompts.
  • Inconsistent toponym disambiguation across models and local place names.
  • Model-specific negative responses to some prompt strategies (performance can drop).

Core Entities

Models

  • gpt-3.5-turbo
  • gpt-4-turbo-2024-04-09
  • gpt-4o
  • claude-3-sonnet-20240229
  • moonshot-v1-8k
  • glm-4

Metrics

  • Accuracy
  • Count S2/S1/S0 (score counts)
  • Per-category WA

Datasets

  • 900-question spatial dataset (this paper, 12 categories)
  • CALVIN (used as source data)
  • PPNL (path planning benchmark referenced)
  • GRASP (referenced)
  • STBench (referenced)
  • CityBench (referenced)

Benchmarks

  • This paper's 900-question spatial benchmark
  • CALVIN
  • PPNL
  • GRASP
  • STBench
  • CityBench