Skip to content

How Much Can You Ask an LLM to Track? Finding the Working Memory Cliff

Posted on:December 8, 2025

When building applications with LLMs, one of the most practical questions is: how much data can I ask the model to work with before it starts making mistakes?

This is not a criticism of LLMs. We already have sorting algorithms for sorting. The goal is to understand the model’s working memory limits so we can design better systems.

The Experiment

I asked GPT-5.1 to sort arrays of random integers (1–10,000) at various sizes, with 10 trials per size. The task is simple: return the array in ascending order. I then checked if the result was an exact match.

List sorting is a nice proxy for “can the model keep track of all these items and manipulate them consistently” and similar list sorting tasks are used in research on language model reasoning and code execution.

Here are the results:

SizePass RatePattern
10100%✅✅✅✅✅✅✅✅✅✅
20100%✅✅✅✅✅✅✅✅✅✅
3090%✅✅✅✅✅✅❌✅✅✅
4060%✅✅✅✅❌❌❌✅❌✅
5050%❌❌✅✅✅✅❌❌✅❌
6060%✅✅❌❌✅❌✅✅❌✅
7020%❌✅❌❌❌❌❌✅❌❌
8020%❌❌❌✅❌❌❌❌❌✅
9030%✅❌❌❌❌❌❌✅❌✅
10010%✅❌❌❌❌❌❌❌❌❌
11010%❌✅❌❌❌❌❌❌❌❌
1200%❌❌❌❌❌❌❌❌❌❌
1300%❌❌❌❌❌❌❌❌❌❌
1400%❌❌❌❌❌❌❌❌❌❌
1500%❌❌❌❌❌❌❌❌❌❌

The Pattern

The “cliff” appears around 30–40 items. Beyond that, accuracy degrades rapidly.

This is consistent with broader evidence from long context evaluations. Studies like Lost in the Middle show that even when models can technically accept long inputs, their ability to reliably use information across many items or positions decays, especially when relevant information is buried in the middle of a long sequence. Benchmarks such as BABILong and similar suites also report that tasks involving long lists or many facts expose limits in how much structure a model can maintain at once.

The Human Comparison

Psychologist George Miller famously argued that human short term memory holds about 7 ± 2 items. Later work by Nelson Cowan and others refines that estimate, suggesting a core capacity closer to 4 ± 1 chunks once you control for rehearsal and grouping strategies.

If you asked a person to look at 100 random numbers once and then write them in sorted order, they would likely fail unless they used external tools or very clever chunking.

By that standard, LLMs perform remarkably well:

This is not a failure. It is an impressive working memory capacity for a purely text based model that is not explicitly designed as a symbolic working memory system. Recent work on “LLM working memory” reaches a similar conclusion: models can juggle several interacting pieces of information, but tasks that require consistent tracking of many distinct items or time intervals push them into error quite quickly.

The real question is how to design systems that work with these limits rather than against them.

Practical Recommendations

When building LLM powered applications, consider these guidelines:

≤30 items: Direct processing

You can usually ask the model to work with this data directly. Examples:

This is where the model’s “native” working memory is strongest and where experiments like the one above show near perfect behaviour.

30–100 items: Expect errors, consider chunking

The model will still work, but you should expect occasional mistakes and subtle inconsistencies. For critical tasks:

Long context research suggests that context windows are not “flat RAM.” Position and salience matter; models tend to use information at the beginning and end of a long prompt more reliably than details buried in the middle. Designing prompts and chunking schemes with that in mind often matters more than simply increasing context length.

100+ items: Use tools or code execution

At this point you should treat the LLM as a planner and orchestrator, not as the thing that manipulates every item directly.

Instead:

There is a growing body of work on tool use and code execution that explicitly follows this pattern. Toolformer trains language models to decide when to call external tools such as calculators and search APIs instead of trying to carry out those operations in tokens. Other work integrates dedicated calculator or code execution modules into LLM systems to get reliable arithmetic and list operations where plain prompting is brittle. Numeric benchmarks like NumericBench and related studies repeatedly find that even strong LLMs are surprisingly error prone on arithmetic and numerical reasoning, which further supports the case for tool use rather than pure text based computation.

The Bigger Picture

This experiment reinforces a key insight about LLM system design:

The model should describe the computation, not perform it.

You can think of the model as an expert that knows what needs to be done, how to decompose the problem, and how to wire together tools that will execute the plan reliably.

This is exactly the direction that research systems have been moving in: LLMs are increasingly used as controllers or “central executives” that orchestrate external tools, rather than as monolithic black boxes that must do everything internally.OpenReview+1

The best LLM applications reflect this architecture and use the model for reasoning and orchestration, while delegating data heavy operations to code execution, databases, search, and specialized services.

Conclusion

Understanding LLM working memory limits is not about finding failure cases. It is about building better systems.

Just as we do not criticize calculators for not writing poetry, we should not expect LLMs to be high throughput data processing engines.

The sweet spot is clear:

When you structure your applications this way, you get the best of both worlds: the flexibility of natural language reasoning and the reliability of deterministic computation.


The code for this experiment is available at github.com/irbull/llm-codegen-benchmark.

Further Reading

A few of the references that inform the discussion above: