Agent Inspector

Pattern 01

Confidence Indicators.

Agents act on probabilistic outputs, but most UIs present them as certainties. Users learn to either over-trust or ignore the system entirely.

Demo

What's the deadline for the Acme contract renewal?

The Acme contract renewal deadline is March 15, 2026. The original contract was signed on March 15, 2023 with a 3-year term.

Ask a follow-up

Why probability needs a face.

Without a confidence signal, users fill the gap themselves, and they fill it badly. One wrong answer at high stakes is enough to poison the well. The user stops acting on any output, including the correct ones, because nothing on screen tells them which is which. The agent becomes a suggestion box, not a collaborator.

The UI problem is rarely raw accuracy. It is honest correspondence between what the model knows and what the interface claims. We treat confidence as the agent's own metacognitive report about how well its stated certainty lines up with outcomes it can defend. When that report is honest, people learn when to move and when to pause.

The goal is calibrated trust, not high trust.

A category is more honest than a number. A label like "high" is a coarse signal, but a coarse signal is what the underlying model can actually support. Recent evaluation work shows that numeric confidence scores drift with prompting, model version, and decoding settings in ways the user cannot see. Three categorical tiers (high, medium, low) survive those shifts and align more closely with how people actually decide. We preserve the underlying score on hover for users who want to inspect it. The headline stays semantic.

Source attribution is what turns a category into evidence. "High confidence" alone is an assertion. "High confidence, grounded in three named systems" is something the user can verify. Confidence without attribution is opacity with a label. The display we ship treats attribution as the primary trust signal and the category as the secondary read.

Anatomy

What each element does.

Inline citations, the confidence chip, source disclosure, and message actions form one metadata system on each assistant turn.

The Acme contract renewal deadline is March 15, 2026. The original contract was signed on March 15, 2023 with a 3-year term.

Ask a follow-up
  1. Inline citation

    In the message body, after each grounded claim

    Numbered [1] pills appear inside prose, anchored to the exact claim they support. Click to reveal the source title, type, and an exact quote. Citations render only when the pattern is on. Off mode shows the prose without them so the baseline stays honest about what the model produced before grounding.

  2. Confidence indicator

    Metadata strip, left edge

    A semantic chip in the metadata strip. Color encodes the calibrated confidence tier, green for high, amber for medium, red for low. Click the chip for the tier rationale plus how many sources corroborate the answer.

  3. Source disclosure

    Metadata strip, immediately right of the chip

    A muted button labeled 3 sources (count is dynamic) with a rotating chevron. Click expands a row of pills, one per system the agent consulted. Each pill shows the system icon and the source title. Pills carry a status indicator only when the source is not independently confirmed, rendered with dimmed text and a colored dot. Verified sources render at full opacity without an indicator.

  4. Message actions

    Metadata strip, right side

    Four icon-only controls on the right of the metadata strip: copy, regenerate, helpful, not helpful. Always visible because hidden feedback controls go unused. Each control surfaces its label on hover via tooltip.

Edge Cases

What this pattern doesn't solve.

Four failure modes that exist outside this pattern's scope. Each requires its own treatment.

01

Failure

When confidence is wrong

A trusted-looking excerpt backs an incorrect factual read. The chip tracks match quality for what was retrieved. It does not test whether upstream storage or indexing was wrong at the corpus layer. Broader ingestion health lives outside this pattern.

02

Failure

When sources within a turn conflict

One source says March 15. Another says April 2. The agent returns a majority answer and the minority signal disappears from the per-message metadata. Surfacing intra-turn disagreement requires a different display layer that this pattern does not own.

03

Failure

When the user disagrees with a turn

The user knows a specific turn's answer is wrong but the metadata strip still reads "high confidence" because the agent was confident in a wrong source. The thumbs-down control captures the disagreement signal but does not propagate to the upstream sources. Source-level correction is out of scope for this pattern. It belongs to a feedback loop layer.

04

Failure

When confidence drops mid-task

A multi-step task starts on solid ground and degrades across turns. The thread shows the change tier-by-tier, but the agent does not surface the trend proactively or warn the user that earlier confident steps may have been built on later uncertain ones. Cross-turn confidence summarization is a separate pattern.

Implementation

The ConfidenceChip primitive.

The chip is the smallest display unit. It accepts a confidence object: a tier, a reasoning string, and the model's internal score. The dot encodes the recommended action. The label encodes the tier in plain language. The internal score is held in reserve and surfaced only on hover, alongside the reasoning, so the numeric signal is available for inspection without competing with the categorical headline.

type ConfidenceTier = "high" | "medium" | "low"

interface ConfidenceData {
  tier: ConfidenceTier
  reasoning: string
}

const TIER_DOT: Record<ConfidenceTier, string> = {
  high: "#1F8B4C",
  medium: "#C8881C",
  low: "#C42929",
}

const TIER_LABEL: Record<ConfidenceTier, string> = {
  high: "High confidence",
  medium: "Medium confidence",
  low: "Low confidence",
}

export function ConfidenceChip({ confidence }: { confidence: ConfidenceData }) {
  return (
    <Popover>
      <PopoverTrigger className="inline-flex items-center gap-1.5 text-[12px]">
        <span
          aria-hidden
          className="size-1.5 rounded-full"
          style={{ backgroundColor: TIER_DOT[confidence.tier] }}
        />
        <span style={{ color: "var(--text-muted)" }}>
          {TIER_LABEL[confidence.tier]}
        </span>
      </PopoverTrigger>
      <PopoverContent side="top" className="max-w-[280px]">
        <p className="text-[13px]">{confidence.reasoning}</p>
      </PopoverContent>
    </Popover>
  )
}

TRADE-OFFS

Why this, and not that.

Five places where the obvious choice and the right choice diverged.

DecisionConsideredReasoning
Categorical tier vs raw percentageHigh/Med/Low chips, 5-star rating, raw probability, sliderNumeric confidence drifts with model version and prompting in ways the user cannot see. Categorical tiers survive that drift and align with how people actually decide: act, verify, or confirm. The raw numeric value stays in the chip detail panel for users who open it.
Inline metadata strip vs separate confidence panelSidebar widget, modal detail view, sticky footer barA separate panel decouples the signal from the claim it qualifies. The user has to look in two places. Inline keeps confidence attached to the answer it describes and removes one step from the verify path.
Named source pills vs verified badge"Verified" badge only, source count alone, full document title with linkA badge tells the user the system is sure of itself. A named source tells the user where to verify. In enterprise agents, sources are systems of record (Notion, CRM, Email). Naming them lets the user open the right tool, not a generic verification page.
Reasoning on hover vs reasoning in rowInline grounding sentence, tooltip-only, expandable detailThe reasoning matters for users who want to inspect the call, but it adds a full extra row of vertical space for users who do not. Keeping that detail behind the chip keeps the row compact while preserving the signal for anyone who opens it.
Three semantic tiers vs twoTwo tiers (act / verify), full gradient, percentage only, traffic-lightTwo tiers force "verify" and "do not act" into the same bucket. Three separates the case where the agent is confident but the user should sanity-check (medium) from the case where the agent itself is unsure (low). Three tiers map to three distinct user actions without adding cognitive load.