Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.
Sampling throws away information: it collapses a whole distribution over the vocabulary into one chosen token. Logprobs give that information back — the log-probability the model assigned to the token it emitted, plus the top-k alternatives it considered. At serve time this unlocks three things builders keep reaching for: confidence signals (low logprobs flag hallucination-prone spans), guided or constrained decoding (score candidate continuations without re-generating), and cheap evals (measure perplexity or exact-answer probability instead of grading text). Knowing how to ask for logprobs, what they cost, and how top-k widens the payload turns a black-box API into an instrument you can read.
The demo requests logprobs from an OpenAI-compatible endpoint (works against hosted APIs and against vLLM's server) and prints each token with its probability and the top alternatives the model weighed. Watch how confident tokens sit near probability 1.0 while genuinely uncertain spans spread mass across several candidates.
from openai import OpenAI
client = OpenAI() # or base_url="http://localhost:8000/v1" for a vLLM server
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "The capital of France is"}],
max_tokens=8,
logprobs=True, # return the chosen token's logprob
top_logprobs=5, # plus the 5 highest-probability alternatives
)
import math
for tok in resp.choices[0].logprobs.content:
p = math.exp(tok.logprob) # logprob -> probability
alts = [(a.token, round(math.exp(a.logprob), 3)) for a in tok.top_logprobs]
print(f"{tok.token!r:12} p={p:.3f} alternatives={alts}")
# Low p or a flat 'alternatives' distribution = the model was unsure there.python3 main.pyUse these three in order. Each builds on the one before.
What are token logprobs and top-k logprobs from an LLM, and what can I use them for at serve time?
Walk me through how the serving engine turns the final logit distribution into a logprob for the chosen token plus top_logprobs alternatives, and why requesting a larger top-k costs more.
I want to use logprobs for confidence scoring, guided decoding, and cheap evals. Explain how each use consumes logprobs, the accuracy/cost trade-offs of top-k, and the pitfalls of interpreting logprobs across temperature settings.
Working within free-tier limits. Free / low-tier provider keys rate-limit aggressively, and eval or agent loops that fan out calls will hit
429 Too Many Requestsfast. Survive it: readRetry-Afterand thex-ratelimit-*headers and back off (exponential backoff with jitter + a max-retry cap) instead of hammering; cap in-flight requests with a small concurrency limiter so you stay under the RPM/TPM ceiling; cache identical requests so retries don't re-spend quota; downshift to a smaller/cheaper model for practice runs; use the provider Batch API for non-interactive jobs; or sidestep hosted limits entirely by running a small model locally (Ollama / llama.cpp) or on a free Colab/Kaggle GPU while you learn.
When the model call fails. Read the error and decide: fix the request, retry, or fall back.
400/422(bad params, context-length exceeded),401/403(auth / no access to that model),404(wrong model id) are fatal — fix and don't retry.429,500/502/503, Anthropic529(overloaded), and timeouts are transient — retry with backoff. Watch for non-HTTP failures too:finish_reason: "length"truncation (raisemax_tokensor continue), safety refusals, malformed JSON / failed tool-call parsing (validate against a schema and repair-retry), and mid-stream disconnects. Always log the provider request id with the error so you can trace it later.