The Self-Improving Prompt Loop: How Telemetry Closes the Gap Between Good and Great
Here's an uncomfortable truth about AI systems: nobody knows which prompts are working and which aren't.
You write a prompt. You deploy it. Users interact with it. Some responses are great, some are mediocre, some are actively harmful. But you have no systematic visibility into prompt performance because the telemetry doesn't exist. The prompt was a string. It went in. Something came out. The string was forgotten.
HUMΛN's prompt management system changes this fundamentally. Every prompt has identity. Every LLM call records which prompts were used. Feedback signals flow back. Performance data accumulates. And a dedicated agent — the Prompt Refinement Agent — monitors everything and proposes improvements.
This is the self-improving prompt loop. It's protocol-level. It's evidence-based. And it keeps humans in the loop for every decision that matters.
The Observability Gap
To understand why this matters, consider what's invisible in a typical AI system:
- Which prompts are used most? No idea. They're inline strings without identity.
- Which prompts cost the most? Unknown. Token consumption isn't tracked per prompt.
- Which prompts produce the best results? Unknowable. There's no feedback mechanism.
- Which model works best for which prompt? Nobody's measuring. Models are assigned statically.
- When a prompt degrades, how do you know? You don't. Until users complain.
This isn't a tooling problem. It's an architectural problem. Without prompt identity, there's nothing to observe. Without telemetry, there's nothing to improve. Without a feedback loop, improvement is manual, ad-hoc, and slow.
Protocol-Level Prompt Telemetry
In HUMΛN's HAIO protocol, prompt telemetry is a MARA (Multi-Agent Runtime Architecture) concern. It's not a Companion feature. It's not an add-on. Every agent that uses ctx.prompts automatically participates in the telemetry pipeline.
The system captures four types of data:
1. Prompt Call Metadata
When an agent loads and composes prompts, the system generates PromptCallMetadata — a structured record of exactly which prompts were used:
interface PromptCallMetadata {
promptUris: string[]; // Full URIs of all prompts used
promptVersions: string[]; // Their versions
layers?: PromptLayer[]; // Full provenance: scope, URI, version per layer
compositionMethod: string; // 'compose' | 'effective' | 'manual'
variablesUsed?: string[]; // Which template variables were substituted (not values)
estimatedTokens?: number; // Pre-call estimate
orgId: string; // Org context
}
This metadata is threaded through the LLM call via ctx.llm.complete({ promptMetadata }). It travels into provenance alongside the model selection, token usage, latency, and cost. After the call, you can trace exactly which prompts contributed to any response.
2. Feedback Signals
Agents and users can report prompt effectiveness through structured feedback:
await ctx.llm.recordPromptFeedback({
provenanceId: result.provenanceId,
signal: 'positive', // or 'negative', 'rephrase', 'correction'
source: 'agent', // or 'user', 'system'
detail: 'High-confidence analysis with structured output',
});
Four signal types capture different quality dimensions:
| Signal | Meaning | Example |
|---|---|---|
positive |
Response met or exceeded expectations | Agent self-reports high-confidence result |
negative |
Response was poor quality | User rates response unhelpful |
rephrase |
User had to rephrase to get a good result | Indicates unclear prompt instructions |
correction |
Output required manual correction | Indicates systematic prompt weakness |
Feedback signals are lightweight and can be generated automatically by agents (based on downstream quality metrics) or manually by users. The key insight: agents can evaluate their own output quality and feed that signal back to the prompt that produced it.
3. Performance Snapshots
Telemetry data is aggregated into PromptPerformanceSnapshot — a comprehensive view of how each prompt is performing:
interface PromptPerformanceSnapshot {
promptKey: string;
window: { start: string; end: string };
stats: {
totalCalls: number;
avgTokens: number;
avgCostUsd: number;
avgDurationMs: number;
positiveSignals: number;
negativeSignals: number;
rephrases: number;
corrections: number;
};
modelBreakdown: Array<{
modelId: string;
callCount: number;
avgLatencyMs: number;
avgCostUsd: number;
feedbackScore: number;
}>;
recommendedModel?: string;
}
This answers previously unanswerable questions: How many times was this prompt used last week? What did it cost? Which model performed best? Is the negative signal rate trending up?
4. Model Affinity
Here's where it gets interesting. Since HUMΛN's Model Registry dynamically selects models based on capability and cost, the same prompt might run on different models across different calls. Over time, the system accumulates data on which (prompt, model) pairs produce the best results.
interface PromptModelAffinity {
promptUri: string;
modelId: string;
sampleSize: number;
stats: {
avgLatencyMs: number;
avgCostUsd: number;
successRate: number;
positiveSignalRate: number;
negativeSignalRate: number;
};
affinityScore: number; // Quality-weighted, cost-adjusted score
}
This affinity data feeds back into model routing as a soft preference signal. When a prompt has strong affinity data showing that GPT-4o produces better results than Claude Sonnet for this specific use case, the Model Registry biases toward GPT-4o — without hard-pinning, so capability-first routing still applies.
The result is a system that gets better at matching prompts to models over time, entirely from empirical evidence.
The Prompt Refinement Agent
Data without action is just a dashboard. The Prompt Refinement Agent is what closes the loop between observation and improvement.
This is a scheduled operations agent — not a human, not a cron job, but a real agent running in the MARA runtime with its own identity, delegation scopes, and audit trail. It:
- Monitors telemetry across all prompts in its scope
- Identifies underperformers using configurable thresholds:
- High negative signal rate (> threshold)
- Low success rate compared to peer prompts
- Cost outliers (same task type, 3x the tokens)
- Model affinity mismatches (prompt consistently assigned to suboptimal model)
- Generates improvement proposals (
PromptChangeProposal):- Analyses telemetry patterns
- Uses an LLM to draft improved prompt text
- Calculates risk level based on prompt scope and usage volume
- Surfaces proposals for human review with full evidence
Governance: Human Approval Is Non-Negotiable
The refinement agent follows strict governance rules:
| Prompt Scope | Auto-Apply? | Approval Required |
|---|---|---|
| Core prompts | Never | Always requires human approval |
| Org prompts (autoTune: true, Low risk) | Yes, with audit log | No (but audited) |
| Org prompts (all others) | No | Human review required |
| User patches | Only with consent, Low risk | User consent |
Core prompts — the ones that define the system's foundational behaviour — never auto-apply. This is a Canon-level guarantee. AI can propose. Humans decide.
The CLI Workflow
Stewards interact with proposals through the CLI:
# Check prompt health
human prompts performance contract-analysis
# Output:
# prompt://org/HUMAN_CORP/legal.contract-analysis@1.0.0
# Calls: 847 (last 7 days)
# Avg tokens: 1,203 | Avg cost: $0.006 | Avg latency: 1.2s
# Positive: 72% | Negative: 12% | Rephrase: 16%
# Best model: gpt-4o (0.92 affinity) > claude-sonnet (0.87)
#
# Recommendations:
# - [PROPOSAL-47] Reduce token count by restructuring output format (-18%)
# - [PROPOSAL-48] Switch recommended model to gpt-4o based on affinity data
# Review a specific proposal
human prompts proposals show PROPOSAL-47
# Shows: full diff, evidence summary, risk assessment, affected usage volume
# Accept — publishes new version
human prompts proposals accept PROPOSAL-47
# Published: legal.contract-analysis@1.1.0
# Previous active: 1.0.0 (available for rollback)
# Or reject with reason
human prompts proposals reject PROPOSAL-47 --reason "Output format change would break downstream parsing"
Every decision — accept, reject, modify — is logged to provenance with full context. The audit trail shows who approved what change, based on what evidence, at what time.
The Virtuous Cycle
Put it all together and you get a system that compounds improvements over time:
Each pass through this loop makes the system measurably better. Prompts get refined based on real usage data. Model routing gets optimised based on empirical affinity. And humans stay in control of every decision that shapes how the system behaves.
This isn't machine learning in the traditional sense. There's no gradient descent. No fine-tuning. It's evidence-based prompt engineering at scale, with the AI doing the analysis and the human making the call.
Why This Is Different
Most AI platforms stop at "deploy the prompt." Some add A/B testing. A few track basic metrics. None of them treat prompts as protocol-level managed artifacts with identity, delegation, telemetry, and a self-improving agent.
HUMΛN's approach is different because it's architectural, not bolted on:
- Protocol-level: Telemetry is built into
ctx.promptsandctx.llm, not a separate observability tool - Identity-first: You can't improve what you can't identify. Prompt URIs make every prompt traceable.
- Agent-driven: The Prompt Refinement Agent is a real agent with delegation scopes, not a dashboard
- Human-in-the-loop: AI proposes, humans decide. Core prompts always require approval.
- Cross-cutting: Every agent benefits, not just the Companion
The self-improving prompt loop is what happens when you design prompt management as infrastructure rather than an afterthought. It's the compound interest on treating prompts as the critical assets they are.
This is the second in a three-part series on HUMΛN's prompt management architecture. Previously: Protocol-Level Prompt Management. Next: From Inline Strings to ctx.prompts: A Developer's Guide.
Prompt Management — Part 3 of 3