Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
By default, an agent calls tools sequentially: call 1 returns, model thinks, call 2 returns, model thinks. For independent tools, that's wasted time — parallel execution cuts latency dramatically. Modern Claude / GPT can emit multiple tool_use blocks in a single response; you just need to handle them concurrently and return all results in one tool_result message. This is a free 2-5x latency improvement on multi-tool queries.
When the model emits 2-4 tool_use blocks in one response, run them in parallel (asyncio, threadpool, whatever your stack uses). Collect all results. Send them back as a list of tool_result blocks in one message. The model now has all the data and produces the answer in one final call. Compare: 'tell me the weather in Paris and Berlin and the time in Tokyo' takes 1 LLM call + 3 parallel tool calls + 1 LLM call = ~3 seconds vs 6+ seconds sequential.
disable_parallel_tool_use=True for any flow where ordering matters (e.g. sequential writes). Most flows don't need it.Use these three in order. Each builds on the one before.
What does parallel tool calling do and what's the latency win?
Walk me through how the model decides to emit multiple tool_use blocks vs sequential. Does it depend on the prompt, the model, or the task?
I have an agent that calls 3 read tools then 1 write tool. Help me design the orchestration so reads parallelize and the write is sequential, and the model can't accidentally parallelize the write.
import asyncio
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
async def execute_tool(name, args):
if name == "get_weather":
return await fetch_weather(args["city"])
if name == "get_time":
return await fetch_time(args["city"])
return {"error": f"unknown tool {name}"}
async def run_agent_parallel(user_msg, max_steps=4):
messages = [{"role": "user", "content": user_msg}]
for _ in range(max_steps):
resp = await client.messages.create(
model="claude-sonnet-4-6", max_tokens=800,
tools=TOOLS, messages=messages,
)
messages.append({"role": "assistant", "content": resp.content})
if resp.stop_reason == "end_turn":
return resp.content[0].text
# GATHER all tool_use blocks
tool_blocks = [b for b in resp.content if b.type == "tool_use"]
# EXECUTE in parallel
results = await asyncio.gather(*[execute_tool(b.name, b.input) for b in tool_blocks])
# RETURN all in one message
messages.append({
"role": "user",
"content": [
{"type": "tool_result", "tool_use_id": b.id, "content": json.dumps(r)}
for b, r in zip(tool_blocks, results)
],
})
# Compare:
# - sequential: 5 tool calls × 800ms each = 4s tool latency
# - parallel: max(5 × 800ms) ≈ 800ms tool latency
# - total query latency drops from ~7s to ~3s on multi-tool queriespython3 main.py