Benchmarking and enhancing spatial reasoning in large language models

Abstract

Spatial reasoning is essential for both human cognition and machine intelligence in understanding and navigating spatial relationships between objects. Despite significant advances in large language models (LLMs) like ChatGPT, spatial reasoning remains a challenging area. This thesis aims to make a contribution to addressing this issue.

Firstly, we analyze the existing benchmarks: bAbI, StepGame, SpartQA, and SpaRTUN, providing initial LLM evaluations and examining their limitations. Results on StepGame demonstrate LLMs' proficiency in mapping natural language to spatial relations, while also highlighting challenges in multi-hop reasoning tasks. As an alternative approach, this thesis also investigates using LLMs to translate the spatial reasoning tasks into a logical format appropriate for an answer set programming reasoner. Experiments demonstrate that this neuro-symbolic approach results in almost perfect accuracy scores on StepGame.

Secondly, the thesis investigates advanced prompting strategies, specifically Chain-of-Thought (CoT) and Tree-of-Thought (ToT) methods, to enhance LLMs' spatial reasoning capabilities. These strategies decompose complex reasoning tasks into manageable steps, significantly improving performance on spatial reasoning benchmarks. CoT and ToT approaches show substantial improvements in accuracy, particularly with complex, multi-hop tasks.

Thirdly, the thesis introduces a novel benchmark based on realistic 3D simulation data, featuring diverse room layouts with various objects and their spatial relationships. This benchmark encompasses a wide range of qualitative spatial relationships, such as topological, directional, and distance relations, and presents scenarios from different viewpoints to reflect real-world complexities. Alongside the benchmark itself, the code is available online, thus allowing arbitrary further versions to be created. A further contribution of this benchmark is the inclusion of a logic-based consistency-checking tool that evaluates multiple plausible solutions, aligning with real-world scenarios where spatial relationships often have several valid interpretations.

This thesis advances the spatial reasoning abilities of LLMs by identifying deficiencies in current benchmarks and proposing practical enhancements. The integrated approach of refining evaluation benchmarks and employing advanced prompting techniques paves the way for future advances in AI spatial reasoning capabilities based on LLMs.

Metadata

Supervisors:	Cohn, Anthony and Hogg, David
Related URLs:	SpatialLM-StepGame Dataset (Research data) RoomSpace Dataset (Research data) AAAI-24 Paper (Related publication) IJCAI-24 Paper (Related publication) Chapter4 Code (Project) Chapter5 Code
Keywords:	Spatial Reasoning, Large Language Models, LLM, Benchmark Evaluation, Neuro-Symbolic Reasoning, Multi-hop Reasoning, Chain-of-Thought, Tree-of-Thoughts
Awarding institution:	University of Leeds
Academic Units:	The University of Leeds > Faculty of Engineering (Leeds) > School of Computing (Leeds)
Depositing User:	Dr Fangjun Li
Date Deposited:	17 Jul 2025 09:13
Last Modified:	17 Jul 2025 09:13
Open Archives Initiative ID (OAI ID):	oai:etheses.whiterose.ac.uk:36677

You do not need to contact us to get a copy of this thesis. Please use the 'Download' link(s) above to get a copy.
You can contact us about this thesis. If you need to make a general enquiry, please see the Contact us page.

Benchmarking and enhancing spatial reasoning in large language models

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Export

Statistics

Benchmarking and enhancing spatial reasoning in large language models

Abstract

Metadata

Download

Final eThesis - complete (pdf)

Related datasets

Export

Statistics