900-question spatial benchmark finds gpt-4o leads; Chain-of-Thought and one-shot prompts can sharply boost performance

Overview

Decision SnapshotNeeds Validation

The study gives actionable comparisons and clear numeric results; prompt effects are large and reproducible in the paper's settings.

Citations4

Evidence Strength0.80

Confidence0.80

Risk Signals11

Trust Signals

Findings with numeric evidence: 7/7

Findings with evidence refs: 7/7

Results with explicit delta: 2/4

Reproducibility

Status: Partial assets available

Open source: Partial

At A Glance

Cost impact: 60%

Production readiness: 60%

Novelty: 40%

Authors

Liuchang Xu, Shuo Zhao, Qingming Lin, Luyao Chen, Qianqian Luo, Sensen Wu, Xinyue Ye, Hailin Feng, Zhenhong Du

Links

Abstract / PDF / Data

Why It Matters For Business

Model choice and prompt style strongly change outcomes on spatial tasks; careful selection plus prompt tuning turns unusable answers into operable outputs.

Who Should Care

Product Manager ML Engineer Data Scientist Engineering Lead CTO

Summary TLDR

The authors built a 900-question spatial benchmark across 12 GIS task types and evaluated six LLMs (gpt-3.5-turbo, gpt-4-turbo-2024-04-09, gpt-4o, claude-3-sonnet-20240229, moonshot-v1-8k, glm-4). Zero-shot results show gpt-4o best (WA 71.3%), gpt-4-turbo 69.7%, glm-4 62.4%, claude 62.1%, moonshot 53.2%, gpt-3.5 43.8%. Models do well on factual and code-explanation tasks but fail on complex reasoning like route planning (gpt-4o zero-shot 12.4%). Prompt tuning fixes specific failures: CoT raised gpt-4o route-planning to 87.5%; one-shot raised moonshot NL2API mapping from 10.1% to 76.3%. The dataset is text-only and expert-validated. Use this benchmark to choose models and prompt strategies by

Problem Statement

We lack a systematic, multi-task benchmark that measures LLM abilities on practical spatial/GIS problems. The paper builds a 900-question dataset across 12 spatial task types and tests several leading models and prompt strategies to reveal capability gaps and prompt sensitivity.

Main Contribution

A 900-question, expert-validated spatial dataset covering 12 task types (foundational, analysis, application)

Two-phase evaluation of six LLMs: zero-shot then targeted prompt tuning (One-shot, Combined, CoT, Zero-shot-CoT)

Key Findings

gpt-4o leads in zero-shot overall accuracy across the 900-question benchmark.

Numbersgpt-4o WA = 71.3% (Table 1)

Practical UseFor general spatial text tasks, prefer gpt-4o over older models when cost permits; expect roughly 70% weighted accuracy out of the box.

Evidence RefTable 1, Section 4

All models excel at conceptual and code-explanation tasks.

NumbersCode explanation WA = 100% for gpt-4o/gpt-4-turbo/claude (Table 2)

Practical UseUse modern LLMs for GIS concept Q&A and code review tasks; minimal prompt engineering needed for high-quality answers.

Evidence RefTable 2, Section 4.1

Results

Metric	Value	Baseline	Delta	Split / Dataset	Evidence	Evidence Ref
Accuracy	gpt-4o 71.3%; gpt-4-turbo 69.7%; glm-4 62.4%; claude 62.1%; moonshot 53.2%; gpt-3.5 43.8%	—	—	900-question spatial dataset (all tasks)	Table 1; Section 4	Table 1
Simple route planning WA (zero-shot vs tuned)	gpt-4o: 12.4% → 87.5% with CoT; gpt-4-turbo: 8.7% → higher with CoT (Section 4.3)	zero-shot per Table 2	gpt-4o +75.1pp	Simple route planning category	Section 4.3, Figure 6	Figure 6

What To Try In 7 Days

Run gpt-4o on your text-only GIS QA and code-review tasks to check baseline WA ~70%

For sequential spatial reasoning, test Chain-of-Thought prompts and validate outputs automatically

If using a weaker model for API-generation, give one clear example (one-shot) to align outputs with your API format

Reproducibility

Code AvailableNo

Data AvailableYes

Open Source StatusPartial

LicenseUnknown

Data URLs

https://figshare.com/s/be55522f22bf761cfcab

Risks & Boundaries

Limitations

Dataset is text-only; no multimodal (image/map) test cases.

Task categories are not exhaustive (e.g., POI recommendation, vector analysis omitted).

When Not To Use

Do not use zero-shot outputs for live route planning or navigation without verification.

Do not treat this dataset as a full multimodal GIS benchmark.

Failure Modes

Hallucinated or non-executable code in code-generation tasks if not validated.

Incorrect sequential reasoning for route planning without CoT prompts.

Core Entities

Models

gpt-3.5-turbogpt-4-turbo-2024-04-09gpt-4oclaude-3-sonnet-20240229moonshot-v1-8kglm-4

Metrics

AccuracyCount S2/S1/S0 (score counts)Per-category WA

Datasets

900-question spatial dataset (this paper, 12 categories)CALVIN (used as source data)PPNL (path planning benchmark referenced)GRASP (referenced)STBench (referenced)CityBench (referenced)

Benchmarks

This paper's 900-question spatial benchmarkCALVINPPNLGRASPSTBenchCityBench

Overview

Trust Signals

Reproducibility

At A Glance

Authors

Links

Why It Matters For Business

Who Should Care

Summary TLDR

Problem Statement

Main Contribution

Key Findings

gpt-4o leads in zero-shot overall accuracy across the 900-question benchmark.

All models excel at conceptual and code-explanation tasks.

Results

What To Try In 7 Days

Reproducibility

Data URLs

Risks & Boundaries

Limitations

When Not To Use

Failure Modes

Core Entities

Models

Metrics

Datasets

Benchmarks

You May Also Want to Read

Large Llama models beat smaller ones on iCliniq medical QA; LLM judges help measure clinical quality

Key finding

PanCanBench: 282 real patient questions + 3,130 expert rubrics to test LLM clinical completeness and factuality

Key finding

A judge-free, n-gram benchmark that approximates GPT-4 judging for Japanese Q&A

Key finding

Professional multilingual TruthfulQA shows truth gaps across languages but smaller than expected

Key finding

MEMERAG: a native multilingual benchmark to evaluate RAG outputs and LLM-based evaluators

Key finding