900-question spatial benchmark finds gpt-4o leads; Chain-of-Thought and one-shot prompts can sharply boost performance

August 26, 20248 min

Overview

Decision SnapshotNeeds Validation

The study gives actionable comparisons and clear numeric results; prompt effects are large and reproducible in the paper's settings.

Citations4

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Liuchang Xu, Shuo Zhao, Qingming Lin, Luyao Chen, Qianqian Luo, Sensen Wu, Xinyue Ye, Hailin Feng, Zhenhong Du

Links

Abstract / PDF / Data

Why It Matters For Business

Model choice and prompt style strongly change outcomes on spatial tasks; careful selection plus prompt tuning turns unusable answers into operable outputs.

Who Should Care

Summary TLDR

The authors built a 900-question spatial benchmark across 12 GIS task types and evaluated six LLMs (gpt-3.5-turbo, gpt-4-turbo-2024-04-09, gpt-4o, claude-3-sonnet-20240229, moonshot-v1-8k, glm-4). Zero-shot results show gpt-4o best (WA 71.3%), gpt-4-turbo 69.7%, glm-4 62.4%, claude 62.1%, moonshot 53.2%, gpt-3.5 43.8%. Models do well on factual and code-explanation tasks but fail on complex reasoning like route planning (gpt-4o zero-shot 12.4%). Prompt tuning fixes specific failures: CoT raised gpt-4o route-planning to 87.5%; one-shot raised moonshot NL2API mapping from 10.1% to 76.3%. The dataset is text-only and expert-validated. Use this benchmark to choose models and prompt strategies by

Problem Statement

We lack a systematic, multi-task benchmark that measures LLM abilities on practical spatial/GIS problems. The paper builds a 900-question dataset across 12 spatial task types and tests several leading models and prompt strategies to reveal capability gaps and prompt sensitivity.

Main Contribution

A 900-question, expert-validated spatial dataset covering 12 task types (foundational, analysis, application)

Two-phase evaluation of six LLMs: zero-shot then targeted prompt tuning (One-shot, Combined, CoT, Zero-shot-CoT)

Key Findings

gpt-4o leads in zero-shot overall accuracy across the 900-question benchmark.

Numbersgpt-4o WA = 71.3% (Table 1)

Practical UseFor general spatial text tasks, prefer gpt-4o over older models when cost permits; expect roughly 70% weighted accuracy out of the box.

Evidence RefTable 1, Section 4

All models excel at conceptual and code-explanation tasks.

NumbersCode explanation WA = 100% for gpt-4o/gpt-4-turbo/claude (Table 2)

Practical UseUse modern LLMs for GIS concept Q&A and code review tasks; minimal prompt engineering needed for high-quality answers.

Evidence RefTable 2, Section 4.1

Results

MetricValueBaselineDeltaSplit / DatasetEvidenceEvidence Ref
Accuracygpt-4o 71.3%; gpt-4-turbo 69.7%; glm-4 62.4%; claude 62.1%; moonshot 53.2%; gpt-3.5 43.8%900-question spatial dataset (all tasks)Table 1; Section 4Table 1
Simple route planning WA (zero-shot vs tuned)gpt-4o: 12.4%87.5% with CoT; gpt-4-turbo: 8.7% → higher with CoT (Section 4.3)zero-shot per Table 2gpt-4o +75.1ppSimple route planning categorySection 4.3, Figure 6Figure 6

What To Try In 7 Days

Run gpt-4o on your text-only GIS QA and code-review tasks to check baseline WA ~70%

For sequential spatial reasoning, test Chain-of-Thought prompts and validate outputs automatically

If using a weaker model for API-generation, give one clear example (one-shot) to align outputs with your API format

Reproducibility

Code AvailableNo
Data AvailableYes
Open Source StatusPartial
LicenseUnknown

Risks & Boundaries

Limitations

Dataset is text-only; no multimodal (image/map) test cases.

Task categories are not exhaustive (e.g., POI recommendation, vector analysis omitted).

When Not To Use

Do not use zero-shot outputs for live route planning or navigation without verification.

Do not treat this dataset as a full multimodal GIS benchmark.

Failure Modes

Hallucinated or non-executable code in code-generation tasks if not validated.

Incorrect sequential reasoning for route planning without CoT prompts.

Core Entities

Models

gpt-3.5-turbogpt-4-turbo-2024-04-09gpt-4oclaude-3-sonnet-20240229moonshot-v1-8kglm-4

Metrics

AccuracyCount S2/S1/S0 (score counts)Per-category WA

Datasets

900-question spatial dataset (this paper, 12 categories)CALVIN (used as source data)PPNL (path planning benchmark referenced)GRASP (referenced)STBench (referenced)CityBench (referenced)

Benchmarks

This paper's 900-question spatial benchmarkCALVINPPNLGRASPSTBenchCityBench