Using Large Language Models to Generate Counterexamples in Graph Theory
Published in Recent Advances in Natural Language Processing, 2025
Abstract
Automated mathematical reasoning has been a popular research topic in natural language processing (NLP) recently. Recent results are focused on constructive tasks like theorem proving, with less attention on other tasks such as counterexample generation, especially using large language models (LLMs). This task is overlooked, as counterexample generation plays a crucial role in many mathematical reasoning tasks. In this work, we construct and analyze two datasets, one of simple, and another of challenging, false graph-theoretic claims to evaluate the capabilities of LLMs on this task. A large proprietary model (GPT-4.1) falsified 85% of claims in the simple set, and 39.5% of claims in the challenge set. On the other hand, a leading open-source model (Qwen Math 2.5 7B) falsified less than 5% of claims in the simple set. We find that while LLMs excel at generating false claims, the primary bottleneck in creating such synthetic benchmarks at scale is claim verification. Guided search methods, including the deep cross-entropy method and evolutionary algorithms, proved computationally infeasible for disproving these claims. This work establishes a baseline for LLM-based counterexample generation in graph theory, highlighting that the verification bottleneck and significant disparities in reasoning capabilities between models are key challenges for the field.
