The Self-Improving Prompt Loop: How Telemetry Closes the Gap Between Good and Great

Here's an uncomfortable truth about AI systems: nobody knows which prompts are working and which aren't.

You write a prompt. You deploy it. Users interact with it. Some responses are great, some are mediocre, some are actively harmful. But you have no systematic visibility into prompt performance because the telemetry doesn't exist. The prompt was a string. It went in. Something came out. The string was forgotten.

HUMΛN's prompt management system changes this fundamentally. Every prompt has identity. Every LLM call records which prompts were used. Feedback signals flow back. Performance data accumulates. And a dedicated agent — the Prompt Refinement Agent — monitors everything and proposes improvements.

This is the self-improving prompt loop. It's protocol-level. It's evidence-based. And it keeps humans in the loop for every decision that matters.

The Observability Gap

To understand why this matters, consider what's invisible in a typical AI system:

Which prompts are used most? No idea. They're inline strings without identity.
Which prompts cost the most? Unknown. Token consumption isn't tracked per prompt.
Which prompts produce the best results? Unknowable. There's no feedback mechanism.
Which model works best for which prompt? Nobody's measuring. Models are assigned statically.
When a prompt degrades, how do you know? You don't. Until users complain.

This isn't a tooling problem. It's an architectural problem. Without prompt identity, there's nothing to observe. Without telemetry, there's nothing to improve. Without a feedback loop, improvement is manual, ad-hoc, and slow.

Protocol-Level Prompt Telemetry

In HUMΛN's HAIO protocol, prompt telemetry is a MARA (Multi-Agent Runtime Architecture) concern. It's not a Companion feature. It's not an add-on. Every agent that uses ctx.prompts automatically participates in the telemetry pipeline.

The system captures four types of data:

1. Prompt Call Metadata

When an agent loads and composes prompts, the system generates PromptCallMetadata — a structured record of exactly which prompts were used:

interface PromptCallMetadata {
  promptUris: string[];        // Full URIs of all prompts used
  promptVersions: string[];    // Their versions
  layers?: PromptLayer[];      // Full provenance: scope, URI, version per layer
  compositionMethod: string;   // 'compose' | 'effective' | 'manual'
  variablesUsed?: string[];    // Which template variables were substituted (not values)
  estimatedTokens?: number;    // Pre-call estimate
  orgId: string;               // Org context
}

This metadata is threaded through the LLM call via ctx.llm.complete({ promptMetadata }). It travels into provenance alongside the model selection, token usage, latency, and cost. After the call, you can trace exactly which prompts contributed to any response.

2. Feedback Signals

Agents and users can report prompt effectiveness through structured feedback:

await ctx.llm.recordPromptFeedback({
  provenanceId: result.provenanceId,
  signal: 'positive',    // or 'negative', 'rephrase', 'correction'
  source: 'agent',       // or 'user', 'system'
  detail: 'High-confidence analysis with structured output',
});

Four signal types capture different quality dimensions:

Signal	Meaning	Example
`positive`	Response met or exceeded expectations	Agent self-reports high-confidence result
`negative`	Response was poor quality	User rates response unhelpful
`rephrase`	User had to rephrase to get a good result	Indicates unclear prompt instructions
`correction`	Output required manual correction	Indicates systematic prompt weakness

Feedback signals are lightweight and can be generated automatically by agents (based on downstream quality metrics) or manually by users. The key insight: agents can evaluate their own output quality and feed that signal back to the prompt that produced it.

3. Performance Snapshots

Telemetry data is aggregated into PromptPerformanceSnapshot — a comprehensive view of how each prompt is performing:

interface PromptPerformanceSnapshot {
  promptKey: string;
  window: { start: string; end: string };
  stats: {
    totalCalls: number;
    avgTokens: number;
    avgCostUsd: number;
    avgDurationMs: number;
    positiveSignals: number;
    negativeSignals: number;
    rephrases: number;
    corrections: number;
  };
  modelBreakdown: Array<{
    modelId: string;
    callCount: number;
    avgLatencyMs: number;
    avgCostUsd: number;
    feedbackScore: number;
  }>;
  recommendedModel?: string;
}

This answers previously unanswerable questions: How many times was this prompt used last week? What did it cost? Which model performed best? Is the negative signal rate trending up?

4. Model Affinity

Here's where it gets interesting. Since HUMΛN's Model Registry dynamically selects models based on capability and cost, the same prompt might run on different models across different calls. Over time, the system accumulates data on which (prompt, model) pairs produce the best results.

interface PromptModelAffinity {
  promptUri: string;
  modelId: string;
  sampleSize: number;
  stats: {
    avgLatencyMs: number;
    avgCostUsd: number;
    successRate: number;
    positiveSignalRate: number;
    negativeSignalRate: number;
  };
  affinityScore: number;    // Quality-weighted, cost-adjusted score
}

This affinity data feeds back into model routing as a soft preference signal. When a prompt has strong affinity data showing that GPT-4o produces better results than Claude Sonnet for this specific use case, the Model Registry biases toward GPT-4o — without hard-pinning, so capability-first routing still applies.

The result is a system that gets better at matching prompts to models over time, entirely from empirical evidence.

Data without action is just a dashboard. The Prompt Refinement Agent is what closes the loop between observation and improvement.

This is a scheduled operations agent — not a human, not a cron job, but a real agent running in the MARA runtime with its own identity, delegation scopes, and audit trail. It:

Monitors telemetry across all prompts in its scope
Identifies underperformers using configurable thresholds:
- High negative signal rate (> threshold)
- Low success rate compared to peer prompts
- Cost outliers (same task type, 3x the tokens)
- Model affinity mismatches (prompt consistently assigned to suboptimal model)
Generates improvement proposals (PromptChangeProposal):
- Analyses telemetry patterns
- Uses an LLM to draft improved prompt text
- Calculates risk level based on prompt scope and usage volume
Surfaces proposals for human review with full evidence

Governance: Human Approval Is Non-Negotiable

The refinement agent follows strict governance rules:

Prompt Scope	Auto-Apply?	Approval Required
Core prompts	Never	Always requires human approval
Org prompts (autoTune: true, Low risk)	Yes, with audit log	No (but audited)
Org prompts (all others)	No	Human review required
User patches	Only with consent, Low risk	User consent

Core prompts — the ones that define the system's foundational behaviour — never auto-apply. This is a Canon-level guarantee. AI can propose. Humans decide.

The CLI Workflow

Stewards interact with proposals through the CLI:

# Check prompt health
human prompts performance contract-analysis
# Output:
# prompt://org/HUMAN_CORP/legal.contract-analysis@1.0.0
# Calls: 847 (last 7 days)
# Avg tokens: 1,203 | Avg cost: $0.006 | Avg latency: 1.2s
# Positive: 72% | Negative: 12% | Rephrase: 16%
# Best model: gpt-4o (0.92 affinity) > claude-sonnet (0.87)
#
# Recommendations:
# - [PROPOSAL-47] Reduce token count by restructuring output format (-18%)
# - [PROPOSAL-48] Switch recommended model to gpt-4o based on affinity data

# Review a specific proposal
human prompts proposals show PROPOSAL-47
# Shows: full diff, evidence summary, risk assessment, affected usage volume

# Accept — publishes new version
human prompts proposals accept PROPOSAL-47
# Published: legal.contract-analysis@1.1.0
# Previous active: 1.0.0 (available for rollback)

# Or reject with reason
human prompts proposals reject PROPOSAL-47 --reason "Output format change would break downstream parsing"

Every decision — accept, reject, modify — is logged to provenance with full context. The audit trail shows who approved what change, based on what evidence, at what time.

The Virtuous Cycle

Put it all together and you get a system that compounds improvements over time:

flowchart TD Auth["Author and publish Prompt v1.0"] Runtime["Agent runtime ctx.prompts.load / compose"] LLM["LLM call + PromptCallMetadata"] Feed["Feedback signals positive · negative · rephrase · correction"] Snap["Performance snapshots patterns revealed"] Aff["Model affinity data routing improves automatically"] PRA["Prompt Refinement Agent identifies underperformers"] Prop["Change proposal with evidence and diff"] Human{"Human steward reviews"} New["Publish v1.1 improved prompt"] Auth --> Runtime Runtime --> LLM LLM --> Feed Feed --> Snap Snap --> Aff Aff --> Runtime Snap --> PRA PRA --> Prop Prop --> Human Human -- approved --> New Human -- rejected --> PRA New --> Runtime

Each pass through this loop makes the system measurably better. Prompts get refined based on real usage data. Model routing gets optimised based on empirical affinity. And humans stay in control of every decision that shapes how the system behaves.

This isn't machine learning in the traditional sense. There's no gradient descent. No fine-tuning. It's evidence-based prompt engineering at scale, with the AI doing the analysis and the human making the call.

Why This Is Different

Most AI platforms stop at "deploy the prompt." Some add A/B testing. A few track basic metrics. None of them treat prompts as protocol-level managed artifacts with identity, delegation, telemetry, and a self-improving agent.

HUMΛN's approach is different because it's architectural, not bolted on:

Protocol-level: Telemetry is built into ctx.prompts and ctx.llm, not a separate observability tool
Identity-first: You can't improve what you can't identify. Prompt URIs make every prompt traceable.
Agent-driven: The Prompt Refinement Agent is a real agent with delegation scopes, not a dashboard
Human-in-the-loop: AI proposes, humans decide. Core prompts always require approval.
Cross-cutting: Every agent benefits, not just the Companion

The self-improving prompt loop is what happens when you design prompt management as infrastructure rather than an afterthought. It's the compound interest on treating prompts as the critical assets they are.

This is the second in a three-part series on HUMΛN's prompt management architecture. Previously: Protocol-Level Prompt Management. Next: From Inline Strings to ctx.prompts: A Developer's Guide.