Logprobs at serve time

hard

Learn with your AI

Open this lesson in your favourite AI. It'll walk you through the why, explain the demo, and quiz you on the try-it list.

Open in Claude Open in ChatGPT

Why this matters

Sampling throws away information: it collapses a whole distribution over the vocabulary into one chosen token. Logprobs give that information back — the log-probability the model assigned to the token it emitted, plus the top-k alternatives it considered. At serve time this unlocks three things builders keep reaching for: confidence signals (low logprobs flag hallucination-prone spans), guided or constrained decoding (score candidate continuations without re-generating), and cheap evals (measure perplexity or exact-answer probability instead of grading text). Knowing how to ask for logprobs, what they cost, and how top-k widens the payload turns a black-box API into an instrument you can read.

Demo

The demo requests logprobs from an OpenAI-compatible endpoint (works against hosted APIs and against vLLM's server) and prints each token with its probability and the top alternatives the model weighed. Watch how confident tokens sit near probability 1.0 while genuinely uncertain spans spread mass across several candidates.

from openai import OpenAI
client = OpenAI()  # or base_url="http://localhost:8000/v1" for a vLLM server

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": "The capital of France is"}],
    max_tokens=8,
    logprobs=True,            # return the chosen token's logprob
    top_logprobs=5,           # plus the 5 highest-probability alternatives
)

import math
for tok in resp.choices[0].logprobs.content:
    p = math.exp(tok.logprob)                     # logprob -> probability
    alts = [(a.token, round(math.exp(a.logprob), 3)) for a in tok.top_logprobs]
    print(f"{tok.token!r:12} p={p:.3f}  alternatives={alts}")

# Low p or a flat 'alternatives' distribution = the model was unsure there.

Run: python3 main.py

Try it yourself

Ask a factual question and confirm the answer tokens have probability near 1.0, then ask an ambiguous one and watch the mass spread across several top_logprobs.

Raise top_logprobs from 5 to 20 and observe the response payload grow — each extra candidate is more tokens over the wire, so top-k has a real cost.

Compute average logprob (a proxy for perplexity) over the completion and use it to rank two candidate answers without regenerating either.

Flag any token whose probability drops below 0.3 as a low-confidence span and inspect whether those spans correlate with errors.

Prompt your AI

Use these three in order. Each builds on the one before.

1. Basics & terminology

What are token logprobs and top-k logprobs from an LLM, and what can I use them for at serve time?

2. Why it works (the mechanism)

Walk me through how the serving engine turns the final logit distribution into a logprob for the chosen token plus top_logprobs alternatives, and why requesting a larger top-k costs more.

3. Advanced — application & what's next

I want to use logprobs for confidence scoring, guided decoding, and cheap evals. Explain how each use consumes logprobs, the accuracy/cost trade-offs of top-k, and the pitfalls of interpreting logprobs across temperature settings.

References

Working within free-tier limits. Free / low-tier provider keys rate-limit aggressively, and eval or agent loops that fan out calls will hit 429 Too Many Requests fast. Survive it: read Retry-After and the x-ratelimit-* headers and back off (exponential backoff with jitter + a max-retry cap) instead of hammering; cap in-flight requests with a small concurrency limiter so you stay under the RPM/TPM ceiling; cache identical requests so retries don't re-spend quota; downshift to a smaller/cheaper model for practice runs; use the provider Batch API for non-interactive jobs; or sidestep hosted limits entirely by running a small model locally (Ollama / llama.cpp) or on a free Colab/Kaggle GPU while you learn.

When the model call fails. Read the error and decide: fix the request, retry, or fall back. 400/422 (bad params, context-length exceeded), 401/403 (auth / no access to that model), 404 (wrong model id) are fatal — fix and don't retry. 429, 500/502/503, Anthropic 529 (overloaded), and timeouts are transient — retry with backoff. Watch for non-HTTP failures too: finish_reason: "length" truncation (raise max_tokens or continue), safety refusals, malformed JSON / failed tool-call parsing (validate against a schema and repair-retry), and mid-stream disconnects. Always log the provider request id with the error so you can trace it later.