How Much Can You Ask an LLM to Track? Finding the Working Memory Cliff

When building applications with LLMs, one of the most practical questions is: how much data can I ask the model to work with before it starts making mistakes?

This is not a criticism of LLMs. We already have sorting algorithms for sorting. The goal is to understand the model’s working memory limits so we can design better systems.

The Experiment

I asked GPT-5.1 to sort arrays of random integers (1–10,000) at various sizes, with 10 trials per size. The task is simple: return the array in ascending order. I then checked if the result was an exact match.

List sorting is a nice proxy for “can the model keep track of all these items and manipulate them consistently” and similar list sorting tasks are used in research on language model reasoning and code execution.

Here are the results:

Size	Pass Rate	Pattern
10	100%	✅✅✅✅✅✅✅✅✅✅
20	100%	✅✅✅✅✅✅✅✅✅✅
30	90%	✅✅✅✅✅✅❌✅✅✅
40	60%	✅✅✅✅❌❌❌✅❌✅
50	50%	❌❌✅✅✅✅❌❌✅❌
60	60%	✅✅❌❌✅❌✅✅❌✅
70	20%	❌✅❌❌❌❌❌✅❌❌
80	20%	❌❌❌✅❌❌❌❌❌✅
90	30%	✅❌❌❌❌❌❌✅❌✅
100	10%	✅❌❌❌❌❌❌❌❌❌
110	10%	❌✅❌❌❌❌❌❌❌❌
120	0%	❌❌❌❌❌❌❌❌❌❌
130	0%	❌❌❌❌❌❌❌❌❌❌
140	0%	❌❌❌❌❌❌❌❌❌❌
150	0%	❌❌❌❌❌❌❌❌❌❌

The Pattern

10–20 items: Perfect accuracy (100%)
30 items: First errors appear (90%)
40–70 items: Coin flip territory (20–60%)
100+ items: Near-total failure (0–10%)

The “cliff” appears around 30–40 items. Beyond that, accuracy degrades rapidly.

This is consistent with broader evidence from long context evaluations. Studies like Lost in the Middle show that even when models can technically accept long inputs, their ability to reliably use information across many items or positions decays, especially when relevant information is buried in the middle of a long sequence. Benchmarks such as BABILong and similar suites also report that tasks involving long lists or many facts expose limits in how much structure a model can maintain at once.

The Human Comparison

Psychologist George Miller famously argued that human short term memory holds about 7 ± 2 items. Later work by Nelson Cowan and others refines that estimate, suggesting a core capacity closer to 4 ± 1 chunks once you control for rehearsal and grouping strategies.

If you asked a person to look at 100 random numbers once and then write them in sorted order, they would likely fail unless they used external tools or very clever chunking.

By that standard, LLMs perform remarkably well:

Humans: ~4–7 items in bare working memory, extended to more only via chunking, rehearsal, and external aids
GPT-5.1 (in this experiment): ~20–30 items with perfect accuracy, with graceful degradation out toward ~100 items

This is not a failure. It is an impressive working memory capacity for a purely text based model that is not explicitly designed as a symbolic working memory system. Recent work on “LLM working memory” reaches a similar conclusion: models can juggle several interacting pieces of information, but tasks that require consistent tracking of many distinct items or time intervals push them into error quite quickly.

The real question is how to design systems that work with these limits rather than against them.

Practical Recommendations

When building LLM powered applications, consider these guidelines:

≤30 items: Direct processing

You can usually ask the model to work with this data directly. Examples:

Processing a short list of search results
Analyzing a handful of customer reviews
Comparing a few options

This is where the model’s “native” working memory is strongest and where experiments like the one above show near perfect behaviour.

30–100 items: Expect errors, consider chunking

The model will still work, but you should expect occasional mistakes and subtle inconsistencies. For critical tasks:

Split the data into smaller batches and aggregate results outside the model
Verify results programmatically when possible (for example, re check that a model provided sort is actually sorted)
Use retrieval or filtering to narrow down the list before asking for detailed reasoning

Long context research suggests that context windows are not “flat RAM.” Position and salience matter; models tend to use information at the beginning and end of a long prompt more reliably than details buried in the middle. Designing prompts and chunking schemes with that in mind often matters more than simply increasing context length.

100+ items: Use tools or code execution

At this point you should treat the LLM as a planner and orchestrator, not as the thing that manipulates every item directly.

Instead:

Have the model generate code to process the data (for example, a small Python function that sorts, filters, or aggregates)
Use function calls or tool APIs so the model can delegate to deterministic components like calculators, databases, or search systems
Stream results through the model in smaller chunks and let it summarize or explain rather than directly compute over the entire raw list

There is a growing body of work on tool use and code execution that explicitly follows this pattern. Toolformer trains language models to decide when to call external tools such as calculators and search APIs instead of trying to carry out those operations in tokens. Other work integrates dedicated calculator or code execution modules into LLM systems to get reliable arithmetic and list operations where plain prompting is brittle. Numeric benchmarks like NumericBench and related studies repeatedly find that even strong LLMs are surprisingly error prone on arithmetic and numerical reasoning, which further supports the case for tool use rather than pure text based computation.

The Bigger Picture

This experiment reinforces a key insight about LLM system design:

The model should describe the computation, not perform it.

You can think of the model as an expert that knows what needs to be done, how to decompose the problem, and how to wire together tools that will execute the plan reliably.

LLMs excel at understanding goals, constraints, and edge cases and at writing or modifying code, prompts, and queries.
They struggle when asked to manually manipulate large amounts of data token by token, especially when many small elements have to be tracked consistently.

This is exactly the direction that research systems have been moving in: LLMs are increasingly used as controllers or “central executives” that orchestrate external tools, rather than as monolithic black boxes that must do everything internally.OpenReview+1

The best LLM applications reflect this architecture and use the model for reasoning and orchestration, while delegating data heavy operations to code execution, databases, search, and specialized services.

Conclusion

Understanding LLM working memory limits is not about finding failure cases. It is about building better systems.

Just as we do not criticize calculators for not writing poetry, we should not expect LLMs to be high throughput data processing engines.

The sweet spot is clear:

Let LLMs reason about what to do with data
Let code or tools actually do it

When you structure your applications this way, you get the best of both worlds: the flexibility of natural language reasoning and the reliability of deterministic computation.

The code for this experiment is available at github.com/irbull/llm-codegen-benchmark.