Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Generation has to stop somewhere, and how it stops is a serving concern with cost and correctness implications. The model stops on an end-of-sequence token, a max-token limit, or a custom stop string — and each has gotchas. A too-low max_tokens truncates answers mid-sentence (finish_reason: length); a missing stop sequence lets the model ramble (wasting tokens and money); a stop string that appears in legitimate output cuts answers short. Controlling stop conditions well is how you bound cost per request and avoid the truncation failures from the error taxonomy.
The demo shows the three stop mechanisms and the finish_reason that tells you which fired — essential for handling truncation (raise max_tokens or continue) vs. clean completion.
Use these three in order. Each builds on the one before.
How does LLM generation decide to stop, and what are the three stop mechanisms?
Explain end-of-sequence tokens, max_tokens limits, and stop sequences, and how finish_reason tells me which fired and what to do about it.
Design robust stop-condition handling for a production endpoint: choosing max_tokens, safe stop sequences, detecting and handling truncation (continue vs. surface partial), and bounding per-request cost.
from anthropic import Anthropic
client = Anthropic()
r = client.messages.create(model="claude-sonnet-4-6", max_tokens=20, # deliberately tiny
messages=[{"role": "user", "content": "List 50 country capitals."}])
print("stop_reason:", r.stop_reason) # 'max_tokens' -> the answer was TRUNCATED
# Handle it: raise max_tokens, or continue generation, or tell the user it's partial.
r2 = client.messages.create(model="claude-sonnet-4-6", max_tokens=500,
stop_sequences=["\n\nEND"], # custom stop string
messages=[{"role": "user", "content": "Write a haiku then output \n\nEND"}])
print("stop_reason:", r2.stop_reason) # 'stop_sequence'python3 main.py