Pattern 01
Agents act on probabilistic outputs, but most UIs present them as certainties. Users learn to either over-trust or ignore the system entirely.
Demo
The Acme contract renewal deadline is March 15, 2026. The original contract was signed on March 15, 2023 with a 3-year term.
Without a confidence signal, users fill the gap themselves, and they fill it badly. One wrong answer at high stakes is enough to poison the well. The user stops acting on any output, including the correct ones, because nothing on screen tells them which is which. The agent becomes a suggestion box, not a collaborator.
The UI problem is rarely raw accuracy. It is honest correspondence between what the model knows and what the interface claims. We treat confidence as the agent's own metacognitive report about how well its stated certainty lines up with outcomes it can defend. When that report is honest, people learn when to move and when to pause.
The goal is calibrated trust, not high trust.
A category is more honest than a number. A label like "high" is a coarse signal, but a coarse signal is what the underlying model can actually support. Recent evaluation work shows that numeric confidence scores drift with prompting, model version, and decoding settings in ways the user cannot see. Three categorical tiers (high, medium, low) survive those shifts and align more closely with how people actually decide. We preserve the underlying score on hover for users who want to inspect it. The headline stays semantic.
Source attribution is what turns a category into evidence. "High confidence" alone is an assertion. "High confidence, grounded in three named systems" is something the user can verify. Confidence without attribution is opacity with a label. The display we ship treats attribution as the primary trust signal and the category as the secondary read.
Anatomy
Inline citations, the confidence chip, source disclosure, and message actions form one metadata system on each assistant turn.
The Acme contract renewal deadline is March 15, 2026. The original contract was signed on March 15, 2023 with a 3-year term.
In the message body, after each grounded claim
Numbered [1] pills appear inside prose, anchored to the exact claim they support. Click to reveal the source title, type, and an exact quote. Citations render only when the pattern is on. Off mode shows the prose without them so the baseline stays honest about what the model produced before grounding.
Metadata strip, left edge
A semantic chip in the metadata strip. Color encodes the calibrated confidence tier, green for high, amber for medium, red for low. Click the chip for the tier rationale plus how many sources corroborate the answer.
Metadata strip, immediately right of the chip
A muted button labeled 3 sources (count is dynamic) with a rotating chevron. Click expands a row of pills, one per system the agent consulted. Each pill shows the system icon and the source title. Pills carry a status indicator only when the source is not independently confirmed, rendered with dimmed text and a colored dot. Verified sources render at full opacity without an indicator.
Metadata strip, right side
Four icon-only controls on the right of the metadata strip: copy, regenerate, helpful, not helpful. Always visible because hidden feedback controls go unused. Each control surfaces its label on hover via tooltip.
Edge Cases
Four failure modes that exist outside this pattern's scope. Each requires its own treatment.
01
Failure
A trusted-looking excerpt backs an incorrect factual read. The chip tracks match quality for what was retrieved. It does not test whether upstream storage or indexing was wrong at the corpus layer. Broader ingestion health lives outside this pattern.
02
Failure
One source says March 15. Another says April 2. The agent returns a majority answer and the minority signal disappears from the per-message metadata. Surfacing intra-turn disagreement requires a different display layer that this pattern does not own.
03
Failure
The user knows a specific turn's answer is wrong but the metadata strip still reads "high confidence" because the agent was confident in a wrong source. The thumbs-down control captures the disagreement signal but does not propagate to the upstream sources. Source-level correction is out of scope for this pattern. It belongs to a feedback loop layer.
04
Failure
A multi-step task starts on solid ground and degrades across turns. The thread shows the change tier-by-tier, but the agent does not surface the trend proactively or warn the user that earlier confident steps may have been built on later uncertain ones. Cross-turn confidence summarization is a separate pattern.
Implementation
The chip is the smallest display unit. It accepts a confidence object: a tier, a reasoning string, and the model's internal score. The dot encodes the recommended action. The label encodes the tier in plain language. The internal score is held in reserve and surfaced only on hover, alongside the reasoning, so the numeric signal is available for inspection without competing with the categorical headline.
type ConfidenceTier = "high" | "medium" | "low"
interface ConfidenceData {
tier: ConfidenceTier
reasoning: string
}
const TIER_DOT: Record<ConfidenceTier, string> = {
high: "#1F8B4C",
medium: "#C8881C",
low: "#C42929",
}
const TIER_LABEL: Record<ConfidenceTier, string> = {
high: "High confidence",
medium: "Medium confidence",
low: "Low confidence",
}
export function ConfidenceChip({ confidence }: { confidence: ConfidenceData }) {
return (
<Popover>
<PopoverTrigger className="inline-flex items-center gap-1.5 text-[12px]">
<span
aria-hidden
className="size-1.5 rounded-full"
style={{ backgroundColor: TIER_DOT[confidence.tier] }}
/>
<span style={{ color: "var(--text-muted)" }}>
{TIER_LABEL[confidence.tier]}
</span>
</PopoverTrigger>
<PopoverContent side="top" className="max-w-[280px]">
<p className="text-[13px]">{confidence.reasoning}</p>
</PopoverContent>
</Popover>
)
}TRADE-OFFS
Five places where the obvious choice and the right choice diverged.
| Decision | Considered | Reasoning |
|---|---|---|
| Categorical tier vs raw percentage | High/Med/Low chips, 5-star rating, raw probability, slider | Numeric confidence drifts with model version and prompting in ways the user cannot see. Categorical tiers survive that drift and align with how people actually decide: act, verify, or confirm. The raw numeric value stays in the chip detail panel for users who open it. |
| Inline metadata strip vs separate confidence panel | Sidebar widget, modal detail view, sticky footer bar | A separate panel decouples the signal from the claim it qualifies. The user has to look in two places. Inline keeps confidence attached to the answer it describes and removes one step from the verify path. |
| Named source pills vs verified badge | "Verified" badge only, source count alone, full document title with link | A badge tells the user the system is sure of itself. A named source tells the user where to verify. In enterprise agents, sources are systems of record (Notion, CRM, Email). Naming them lets the user open the right tool, not a generic verification page. |
| Reasoning on hover vs reasoning in row | Inline grounding sentence, tooltip-only, expandable detail | The reasoning matters for users who want to inspect the call, but it adds a full extra row of vertical space for users who do not. Keeping that detail behind the chip keeps the row compact while preserving the signal for anyone who opens it. |
| Three semantic tiers vs two | Two tiers (act / verify), full gradient, percentage only, traffic-light | Two tiers force "verify" and "do not act" into the same bucket. Three separates the case where the agent is confident but the user should sanity-check (medium) from the case where the agent itself is unsure (low). Three tiers map to three distinct user actions without adding cognitive load. |