LLMs can plan. They just can’t reliably recover.

A new benchmark, LLM-WikiRace reveals, the bottleneck is no longer knowledge. It’s adaptive planning.

The setup

  • Models must navigate from one Wikipedia page to another using only hyperlinks.
  • No global graph view.
  • 50 link options per step.
  • 30-step limit.
  • A real knowledge graph of 549,232 pages.

This is open-domain, partially observable planning; the kind real AI systems face.

What happens?

Easy paths (3–4 steps): Top models exceed 90%. Gemini 3 hits 95%.

Medium paths (5–6 steps): Performance drops to ~50–66%.

Hard paths (7–8 steps): Best model reaches 23%. Most fall below 20%.

No model crosses 25% on the hardest split.

Extend the horizon - performance collapses.

Knowledge helps - but only to a point. After that, planning and adaptive control dominate.

In one trace, a model correctly infers that the village “Sječevač” is in the Balkans. It navigates through Belgium → Serbia, confidently reasoning that it is heading toward the right region. The problem? The village is actually in Croatia. The model spends the rest of the episode circling Serbian cities before time runs out.

  • It knew geography.
  • It reasoned step by step.
  • It never corrected course.

That’s the Planning Gap.

The real failure mode: looping

On hard tasks:

  • Loop frequency >80%
  • Recovery rates near zero
  • Some pages revisited 6–7 times

There’s a strong negative correlation between looping and success.

Even more striking: models often recognize they are looping in their reasoning, and still fail to adapt.

Awareness ≠ correction.

What this reveals:

  • LLMs are strong at generating plans.
  • They often move toward broad “hub” pages to expand options.
  • They articulate forward-looking strategies.

But when a strategy fails, they struggle to pivot.

  • They recommit.
  • They oscillate.
  • They stay stuck.

Reinforcement fine-tuning dramatically improves easy tasks (22% → 67%).

It barely moves medium tasks. It does nothing for hard ones.

This benchmark mirrors:

  • Multi-step enterprise workflows
  • Legal reasoning chains
  • IT troubleshooting
  • Supply chain coordination
  • Autonomous agents in production

In all of these, initial planning is rarely the hardest part. Recovery is.

An AI that knows it is wrong — but cannot recover — is fragile.

Bottom line:

LLMs are brilliant short-horizon planners. They are not yet reliable long-horizon operators.

Reference: LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

#AI #AgenticAI #Planning #WomenInTech #LeanIn