Anthropic recently announced that it has acquired Bun, the ultra-fast JavaScript and TypeScript runtime. On the surface this looks like an investment in developer tooling for Claude Code. Under the surface it signals something much larger.
LLMs are about to gain first-class, built-in computation.
This shift has major implications for how we design agentic systems, how we reason about tool calling, and how we build AI-powered workflows.
In this post I explain why having a JavaScript runtime inside the inference engine will fundamentally change the way we build with AI. I also explain why generating and executing small TypeScript snippets is often better than making multiple tool calls through MCP or similar mechanisms.
1. Tool Calls Are Useful, but They Are Not Always the Right Abstraction
The Model Context Protocol (MCP) is currently the standard way to allow AI models to interact with external systems. Tools are essential for reaching out to APIs, databases, ticketing systems, and any other environment where real-world effects matter.
However, many everyday tasks do not need external systems. Examples include:
- Filtering a dataset
- Computing aggregates
- Running a quick simulation
- Transforming nested JSON
- Reconciling intermediate results
Tool calls feel heavy for these types of operations. Each tool requires a schema, validation logic, serialization, a round trip to a server, and error handling. In practice this can turn simple computational steps into a full RPC workflow.
Meanwhile the model already knows how to express these operations in JavaScript and TypeScript. There is no need to force it through a tool interface for something it could express directly.
This is where a built-in JavaScript runtime changes everything.
2. Code Snippets as Core Reasoning Primitives
Suppose as part of a larger task, an LLM needs to:
“Filter out all events where
duration < 50ms, group byuserId, and compute averages.”
Instead of orchestrating several tool calls, the model could simply write:
const filtered = events.filter(e => e.duration >= 50);
const grouped = Object.groupBy(filtered, e => e.userId);
const result = Object.entries(grouped).map(([id, group]) => ({
id,
avg: group.reduce((a, b) => a + b.duration, 0) / group.length
}));
A sandboxed runtime executes this snippet, the model reads the result, and the reasoning continues.
This approach is often:
- Faster, because no network hop is required
- Simpler, because there is no schema to maintain
- More flexible, because the model is already fluent in TS
- More composable, because snippets can be chained together in a single environment
For many tasks it is more natural to let the model generate a small piece of code and run it directly. In effect, JavaScript becomes the computation tool by default.
3. The Context Window Is Not for Data
One might ask: “Why not just load the data into the model’s context window and let it compute directly?”
Modern models have context windows of 100K, 200K, or even 1M+ tokens. But this does not mean we should fill them with raw data. Here is why.
The math does not work
Consider our event filtering example with real-world scale:
| Events | ~Tokens (at 50 per event) | Fits in 128K context? |
|---|---|---|
| 100 | 5,000 | ✅ Yes |
| 1,000 | 50,000 | ⚠️ Barely |
| 10,000 | 500,000 | ❌ No |
| 1,000,000 | 50,000,000 | ❌ Impossible |
Even if the data fits, the model struggles to perform reliable arithmetic across thousands of items. LLMs are not calculators—they are pattern matchers trained on text.
Empirical evidence
I tested this directly using GPT-5.1 with the same filter-group-average task:
| Events | Tokens Used | Latency | Accuracy | Status |
|---|---|---|---|---|
| 10 | 282 | 1.7s | 100% | ✅ Correct |
| 100 | 1,511 | 3.8s | 90% | ⚠️ Partial errors |
| 1,000 | 10,614 | 7.1s | 0% | ❌ Wrong |
At just 100 events, accuracy begins to degrade. At 1,000 events, the model produces completely wrong results. Even the latest GPT-5.1 cannot reliably perform arithmetic across large datasets.
The code-generation alternative
Compare this to the code-generation approach, where the model writes a TypeScript snippet and a runtime like Bun executes it:
| Events | Tokens Used | Latency | Accuracy |
|---|---|---|---|
| 100 | 360* | 44ms | 100% |
| 1,000 | — | 37ms | 100% |
| 10,000 | — | 41ms | 100% |
| 100,000 | — | 58ms | 100% |
| 1,000,000 | — | 228ms | 100% |
*Code generation is a one-time cost (~2.5s). The same code executes on all dataset sizes.
The model generates 360 tokens of code once. The runtime handles the rest with perfect accuracy, regardless of dataset size.
Here is the actual code GPT-5.1 generated:
const result = Object.entries(
events
.filter(event => event.duration >= 50)
.reduce<Record<string, { sum: number; count: number }>>((acc, { userId, duration }) => {
if (!acc[userId]) {
acc[userId] = { sum: 0, count: 0 };
}
acc[userId].sum += duration;
acc[userId].count += 1;
return acc;
}, {})
).map(([id, { sum, count }]) => ({
id,
avg: Math.round((sum / count) * 100) / 100
})).sort((a, b) => a.id.localeCompare(b.id));
This is the key insight: the model should describe the computation, not perform it.
4. Streaming Data Formats and the Runtime
Real-world data often lives in streaming-friendly formats like Apache Avro, Parquet, or NDJSON. These formats are designed for high-throughput data pipelines, not for loading into LLM context windows.
A runtime like Bun can read these formats directly:
import avro from "avsc";
// Stream events from an Avro file
const decoder = avro.createFileDecoder("events.avro");
const events: Event[] = [];
decoder.on("data", (record) => events.push(record));
decoder.on("end", () => {
// Process with the generated code
const filtered = events.filter(e => e.duration >= 50);
const grouped = Object.groupBy(filtered, e => e.userId);
// ...
});
This pattern is powerful for several reasons:
- Separation of concerns: The LLM generates processing logic; the runtime handles I/O
- Scalability: Streaming processes data incrementally without loading everything into memory
- Compatibility: Works with the same data formats used in production pipelines (Kafka, Spark, etc.)
The model does not need to see your data. It just needs to describe how to process it. The runtime handles the rest.
5. Why Bun Fits This Future Better Than Other Runtimes
Node or Deno could have served as the execution layer, but Bun brings several characteristics that make it ideal for LLM integration.
✔ Very fast startup time
Agent loops often involve many short snippets. Cold-start cost matters a great deal.
✔ A single binary that includes runtime, bundler, package manager, and test runner
This creates a small, predictable surface area for embedding in an AI system.
✔ Excellent TypeScript support
Since LLMs are already strong at TypeScript, the feedback loop is smooth and efficient.
✔ Used internally by Claude Code today
Anthropic confirmed that Bun already powers Claude Code’s execution engine. Acquiring Bun gives Anthropic deep control over the entire “generate code, run code, reflect on output” process.
✔ A foundation for AI-specific extensions
Once Anthropic controls the runtime they can introduce:
- Safer filesystem operations
- Sandboxed networking
- Custom developer-focused syscalls
- Better observability hooks
This turns Bun into an AI-native computation substrate, rather than simply a faster replacement for Node.
6. Computation During the Thinking Stage
Hidden reasoning traces, sometimes called the “thinking tokens,” offer another interesting opportunity. Nothing prevents an LLM from generating a hypothesis and testing it with a short snippet of JavaScript during its internal reasoning process.
The model can:
- Form an idea
- Write a small piece of code
- Run it inside the runtime
- Observe the output
- Update its idea before responding to the user
This allows for:
- Internal self-debugging
- Precise numerical computation
- Small internal simulations
- Iterative refinement inside the chain of thought
The result is a hybrid form of reasoning where statistical inference and program execution support one another.
7. When Code Snippets Are Better Than Tools, and When Tools Still Matter
A clear boundary is beginning to emerge.
Choose JavaScript snippets for:
- Transforming data you already have
- Pure computations
- Local reasoning
- Small brute-force or search tasks
- Formatting or parsing text
- Building glue logic around tool outputs
Choose MCP tools for:
- Interacting with external systems
- Operations that affect real data
- Long-running or heavyweight tasks
- Workflows that must be auditable and deterministic
- Systems used by multiple teams
Many agent designs spend tool calls on tasks that do not require them. A built-in runtime solves this by allowing pure computation to stay local.
8. The Larger Direction This Signals
If we assemble all of these pieces, a future architecture becomes easy to imagine.
- The model proposes a next action.
- The runtime executes the code and returns exact results.
- The model uses those results to refine its plan.
- The model only reaches for MCP tools when external systems or side effects are required.
In this world, internal computation becomes cheap and reliable. It becomes part of the inference loop itself rather than a separate system.
Anthropic’s acquisition of Bun strongly suggests they see this future as well. It is a natural evolution of agentic AI systems, and it opens the door to much more capable reasoning.
Allowing an LLM to run its own code in a safe, fast, local environment is a foundational upgrade to what these models can do.
Final Thoughts
I have spent a lot of time working with LLM-driven agents, tool calling, and context engineering. Every year we get a step closer to models that do not only describe solutions but compute and verify them too.
Adding a runtime like Bun directly into the inference engine is a major step. Models gain the ability to execute precise logic whenever they need it, even during internal thinking.
This is a significant shift, and I believe it is only the beginning.
The code used in this post is available on GitHub: https://github.com/irbull/llm-codegen-benchmark
Or just wait for the Fireship video to drop 😉