108. DEPLOYMENT MODELS & HOSTING STRATEGY

From Hosted to Self-Hosted Without Rewriting: The Zero-Regret Architecture


"Start on HUMAN-hosted in 5 minutes. Move to your own infra in 5 days, without rewriting your app."

That's the barrier killer.

This document defines how HUMAN supports every deployment model β€” from 10-person teams to regulated enterprises β€” through clean architectural boundaries and zero-regret migration paths.


THE DESIGN PRINCIPLE: ZERO-REGRET HOSTING

We keep our core invariants:

  • Keys live on devices (Passport rooted in hardware, not our cloud)
  • Cloud is coordination + proofs, not raw user data
  • Storage is pluggable behind clean adapters

The mantra becomes:

"Start on HUMAN-hosted in 5 minutes. Move to your own infra in 5 days, without rewriting your app."

This solves the fundamental tension:

  • SMBs need: "Please don't make me think about infra. Just make it work."
  • Enterprises need: "Cool idea, but it has to run in our VPC / DB / SIEM / whatever."

One architecture. Multiple deployment profiles. Same APIs. Same semantics.


WHAT HUMAN HOSTS VS WHAT WE REFUSE TO HOST

Think in 3 layers:

πŸ” Layer 0 – Devices (non-negotiable)

Passport keys live on:

  • iPhones, Macs, laptops, Android devices, etc.

We never host:

  • Root identity keys
  • Raw biometric / behavioral signals

We may host:

  • Public keys
  • Signed assertions ("this key belongs to Org X, role Y"), but not the keys themselves

This is structural: HUMAN cannot hold your identity even if we wanted to.


🧠 Layer 1 – HUMAN Control Plane (this is the "managed hosting" we do want)

This is the stuff we can run as HUMAN Cloud or you can self-host:

What Lives Here

  • Identity federation: Mapping "IdP user 123" β†’ "Passport subject ABC"
  • Capability Graph: Roles, permissions, what a given human/agent is allowed to approve
  • HumanOS policy engine: "When an AI tries to do X, require human Y or escalate to group Z"
  • Attestation / ledger interface: Where "AI did X under these conditions with these humans" gets recorded

This is metadata about trust, not the org's documents/emails/PHI.

Deployment Options

We:

  • Offer this as managed for SMB / fast starts (HUMAN Cloud)
  • Offer hybrid / self-hosted for enterprises who want it in their VPC

πŸ“¦ Layer 2 – Data & Systems (we stay aggressively out of here)

What lives in:

  • Salesforce, Google Workspace, O365, Epic, internal DBs, S3 buckets, etc.

We don't become their database:

We connect to these via:

  • OAuth / service accounts / private links

We only store:

  • IDs, hashes, and pointers needed for provenance and policy decisions

Result:

βœ… Yes to hosting the trust fabric
❌ No to becoming their data warehouse


THE THREE DEPLOYMENT PROFILES

We make this a first-class concept:

Profile 1: Hosted (SMB Default)

What it means:

Everything in HUMAN Cloud, except:

  • Keys on devices
  • Primary data in their SaaS tools

Who it's for:

  • 10–200 person companies
  • Teams without dedicated infra people
  • Organizations prioritizing speed over control

Setup time: 5 minutes

Monthly cost: Plan-based ($X–$XX per user + governed events)

Migration path: Can move to Hybrid or Self-hosted later with export/import tooling


Profile 2: Hybrid (Enterprise Common)

What it means:

HUMAN control plane in our cloud:

  • Policy engine
  • Capability Graph
  • Federation of identities

Data plane in their infra:

  • They run the ledger node(s)
  • They host any caches / sensitive stores

Connect via:

  • Outbound-only secure tunnels
  • Or VPC peering, depending on taste

Who it's for:

  • Mid-market to enterprise (200–10,000 employees)
  • Organizations with infra teams but want operational simplicity
  • Companies with data residency requirements but flexible on control plane

Setup time: 1–2 days

Monthly cost: Platform fee + usage + optional support

Migration path: Can move to full Self-hosted when compliance requires it


Profile 3: Self-Hosted (Regulated / Gov)

What it means:

We give them:

  • Helm charts / Terraform / Ansible
  • Reference architecture
  • AI-powered installation automation (see "AI-Powered Installation Automation" section)
  • Compliance templates (HIPAA, FedRAMP, PCI-DSS, GDPR)

They run (self-hosted in their infrastructure):

  • HumanOS orchestration engine

    • Policy engine (escalation rules, safety boundaries)
    • Routing logic (capability-first task assignment)
    • Approval queue service
  • Capability Graph (org-scoped view)

    • Internal employee and agent capabilities
    • Skill tracking and growth
    • Capability attestations (org-namespaced)
  • Workforce Cloud Runtime (internal routing layer)

    • Agent-to-agent task orchestration
    • Agent-to-human escalation (employees or customers)
    • Capability-based task assignment within organization
    • Internal workflow execution engine
  • MARA Runtime (agent execution environment)

    • Agent pods and workloads
    • Agent registry service
    • Execution monitoring
  • Ledger nodes (local audit trail)

    • Provenance logs
    • Attestation storage
    • Immutable audit trail
    • Optional federation with HUMAN's public ledger
  • All storage (databases, object storage, caches)

    • PostgreSQL (policies, workflows, history)
    • Vector store (agent memory, capability embeddings)
    • Object storage (MinIO, S3)
    • Redis (caching, sessions)
  • Monitoring & observability stack

    • Prometheus, Grafana
    • Loki (log aggregation)
    • Tempo (distributed tracing)

They access via API (HUMAN-hosted, optional services):

  • Workforce Cloud Global Marketplace (optional, pay-per-task)

    • For routing to external trained humans beyond internal staff
    • 24/7 coverage, surge capacity, specialized expertise
    • Academy-trained workers globally
    • Pricing: Usage-based ($50 per escalation + $75 per human-hour)
    • Use cases: Overflow beyond org's staff, specialized skills, 24/7 ops
  • Academy Training Platform (free for individuals, volume pricing for enterprises)

    • Web-based access for employee training
    • Always free for displaced workers (Zero Barriers principle)
    • Enterprise bulk programs: $500/employee/year (volume discounts available)
    • Requirements: Internet connectivity (no self-hosted option by design)
    • Integration: SSO, custom learning paths, capability sync via API
    • See: KB 24 (Academy) for full deployment model
  • Global Capability Federation (optional, subscription)

    • Cross-org credential verification
    • Interoperability with other HUMAN deployments
    • Verify capabilities from external organizations
    • Pricing: $10K/year
  • Public Ledger Anchoring (included in platform license)

    • Global attestation root of trust
    • Distributed ledger for cross-org verification
    • Customer can run federated ledger nodes
    • Anchoring to HUMAN's public ledger for global validity

Deployment modes:

  • Air-Gapped (Full Isolation):

    • Runs WCR, MARA, HumanOS, Capability Graph fully offline
    • No access to Global Marketplace or Academy
    • Local-only ledger (no global verification)
    • Org-only capability tracking
    • Requires internal image registries, Helm mirrors

    Canonical note (Passport growth still happens): Even in full isolation, governed work events still update the local Capability Graph and create local attestations. The Passport evolves in-place via updated CapabilityGraphRoot (pointer to the head of the personal graph in the on-prem vault) and new LedgerRefs (attestation anchors on the on-prem ledger).

    Example data placement (air-gapped):

    personalGraphs:
      storage: on_prem_vault
    evidence:
      storage: on_prem_vault
    attestations:
      storage: on_prem_ledger
      federation:
        enabled: false
    
  • Hybrid (Internal + Global Services):

    • Self-hosted control plane + optional HUMAN Cloud services
    • Internal routing for org's tasks
    • API access to Global Marketplace for overflow
    • Employee access to Academy for training
    • Most common for regulated enterprises

Who it's for:

  • Regulated industries (healthcare, finance, government)
  • Organizations with strict data sovereignty requirements
  • Companies with mature platform engineering teams
  • Air-gapped environments (defense, intelligence)
  • Multi-national corporations with data residency compliance

Setup time:

  • With AI installation assistant: 5-15 minutes (automated)
  • With intelligent CLI: 1-2 hours (semi-automated)
  • Manual Helm/Terraform: 1-2 weeks (depends on infra complexity)

Monthly cost:

  • Platform license: $30K-$150K+/year (based on scale, see KB 34)
  • Support contract: Included (24/7 for Enterprise Elite)
  • Optional services: Usage-based (Workforce Cloud, Academy bulk, Federation)
  • Infrastructure costs: Customer responsibility (compute, storage, networking, AI tokens)

Migration path: This is the end state; no further migration needed


SELF-HOSTED SECURITY BOUNDARIES

Overview

Self-hosted deployments provide maximum control and data sovereignty, but they do not grant identity minting authority or bypass cryptographic safeguards.

This section explicitly defines what self-hosted infrastructure CAN and CANNOT do, and why infrastructure compromise doesn't threaten human sovereignty.

Key Principle: Trust derives from cryptography, not operational control.


What Self-Hosted Infrastructure CAN Do

βœ… Identity Verification

  • Verify Passport signatures cryptographically
  • Validate DID resolution and key ownership
  • Check delegation chains for authenticity
  • Verify attestation signatures

Why Safe: Verification requires only public keys. No private key access needed.


βœ… Org-Scoped Attestations

  • Issue attestations within organizational namespace
  • Attest to employment, roles, permissions within the org
  • Sign attestations with org's private key (held in org HSM)

Why Safe: Org attestations are namespaced. They don't affect other organizations or create global identity.


βœ… Policy Engine & Agent Runtime Hosting

  • Run HumanOS policy engine
  • Host agent execution environments
  • Enforce escalation rules and safety boundaries
  • Route tasks based on capability requirements

Why Safe: Policy enforcement is read-only verification. Cannot override cryptographic constraints.


βœ… Org and Agent Key Custody

  • Hold Org Passport keys in organizational HSMs
  • Custody Agent Passport keys under policy constraints
  • Manage agent delegation certificates

Why Safe: These are delegated identities, not sovereign identities. They derive authority from humans, not from infrastructure.


What Self-Hosted Infrastructure CANNOT Do

These constraints are cryptographically enforced, not policy-based. Violating them renders the deployment non-compliant with the HUMAN Protocol.

❌ Server-Side Human Passport Creation

Forbidden:

  • Minting Human Passports on servers
  • Generating human identity keys in infrastructure
  • Creating "admin" identities that impersonate humans

Why Forbidden:

  • Human Passports MUST be created on-device (Secure Enclave, TEE, hardware key)
  • Private keys MUST NEVER leave the device
  • Only devices can prove human presence (biometric, passkey)

Technical Enforcement:

  • Device attestation required for Human Passport minting
  • Ledger rejects Human Passports without device signature
  • Other deployments reject server-minted identities

Result: Self-hosted infrastructure physically cannot create human identities that other systems will accept.


❌ Admin-Minted "Human" Identities

Forbidden:

  • Admins creating "human" accounts for convenience
  • Shared credentials representing multiple humans
  • Service accounts masquerading as humans

Why Forbidden:

  • Violates identity sovereignty (humans own their identity)
  • Breaks provenance (can't distinguish human from admin action)
  • Creates liability (who is responsible for actions?)

Technical Enforcement:

  • Human Passports require device-rooted keys
  • Capability Graph rejects capability updates from non-device sources
  • Attestations require human signature, not admin signature

Result: Admin convenience cannot override identity architecture.


❌ Shared or Pooled Human Signing Keys

Forbidden:

  • Multiple humans sharing one private key
  • "Team" identities with shared credentials
  • Delegating human signing authority to infrastructure

Why Forbidden:

  • Destroys accountability (who signed this?)
  • Breaks provenance chain (no attribution)
  • Enables impersonation (anyone with key = "you")

Technical Enforcement:

  • Each human has unique DID and keypair
  • Private keys never exported from device
  • Signature verification checks specific DID

Result: Infrastructure cannot hold human private keys, even if admins request it.


❌ Identity Recovery Performed by Infrastructure

Forbidden:

  • Admins "recovering" human identity without human authorization
  • Resetting human private keys from servers
  • Backdoor recovery mechanisms

Why Forbidden:

  • Recovery without human = identity theft
  • Breaks trust (infrastructure can impersonate)
  • Creates legal liability (unauthorized access)

Technical Enforcement:

  • Recovery requires guardian quorum (other humans, not servers)
  • Recovery process uses threshold cryptography (no single point of failure)
  • Ledger logs all recovery attempts

Result: Only humans (via guardian network) can recover human identity. Infrastructure cannot override.


❌ Silent Identity Creation or Modification

Forbidden:

  • Creating identities without human approval
  • Modifying identity records without signed consent
  • "Backdating" identity changes

Why Forbidden:

  • Violates consent (humans must approve)
  • Breaks provenance (no audit trail)
  • Enables fraud (who made this change?)

Technical Enforcement:

  • All identity changes require signature from identity owner
  • Ledger anchors record creation and modification timestamps
  • Unsigned changes rejected by protocol

Result: Infrastructure cannot modify identity, even with "good intentions."


Breach Blast Radius Analysis

Understanding what an attacker gains by compromising different deployment types:

Hosted Profile Breach (HUMAN Cloud Compromise)

What Attacker Gains:

  • Disruption of service (DoS)
  • Metadata about API usage (traffic patterns)
  • Ability to issue fake attestations (rejected by verification)

What Attacker CANNOT Gain:

  • Human private keys (never stored server-side)
  • Ability to mint Human Passports (device-only)
  • Ability to impersonate humans (no private keys)
  • Ledger modification (distributed, immutable)

Customer Impact:

  • Hosted customers: Service interruption (failover to backup region)
  • Self-hosted customers: Zero impact (independent deployments)

Mitigation:

  • Multi-region active-active (automatic failover)
  • Keys on devices (zero server-side exposure)
  • Ledger distribution (no single point of truth)

Hybrid Profile Breach (Data Plane Compromise)

What Attacker Gains:

  • Access to customer's ledger nodes (can disrupt sync)
  • Access to org attestations (can view org-specific data)
  • Potential ability to issue fake org attestations (namespaced)

What Attacker CANNOT Gain:

  • Human private keys (on devices)
  • Ability to mint Human Passports (device-only)
  • Access to other orgs' data (namespace isolation)
  • Ability to override human decisions (cryptographically enforced)

Customer Impact:

  • Affected customer: Must revoke org key and re-issue attestations
  • Other customers: Zero impact (namespace isolation)
  • Humans: Zero impact (keys on devices)

Mitigation:

  • Org key revocation via ledger broadcast
  • Namespace isolation prevents cross-org contamination
  • Audit trail reveals all actions during compromise window

Self-Hosted Profile Breach (Full Infrastructure Compromise)

What Attacker Gains:

  • Full access to org's deployment (database, services, keys)
  • Ability to issue fake org-scoped attestations
  • Ability to disrupt org's operations
  • Metadata about org's agent usage

What Attacker CANNOT Gain:

  • Human private keys (on devices, not in infrastructure)
  • Ability to mint Human Passports (device-only)
  • Ability to impersonate humans in other orgs (namespace isolation)
  • Ability to modify capability records for humans (requires human signature)
  • Access to distributed ledger state (replicated across network)

Customer Impact:

  • Affected org: Must revoke org key, rebuild infrastructure
  • Other orgs: Zero impact (namespace isolation)
  • Humans: Identity intact (keys on devices)

Mitigation:

  • Human keys never in infrastructure (zero exposure)
  • Org key revocation invalidates all attestations
  • Other orgs' verification rejects compromised attestations
  • Humans can revoke consent and move to new org deployment

Critical Insight: Even complete infrastructure compromise doesn't grant attacker human identity authority.


Open Source Safety Guarantees

Q: Can self-hosted customers modify HUMAN code to bypass these restrictions?

A: No. Protocol compliance is mathematically enforced, not code-enforced.

Why Code Modification Doesn't Grant Authority

  1. Cryptographic Verification is Protocol-Level

    • Even modified code must verify Ed25519 signatures
    • Invalid signatures are rejected by other nodes
    • Forked implementations cannot interoperate without compliance
  2. Network Effects Enforce Standards

    • Distributed ledger rejects non-compliant attestations
    • Other deployments ignore invalid signatures
    • Humans choose which implementations to trust
  3. Device Keys are the Source of Truth

    • Human identity keys live on devices, not in code
    • Infrastructure verifies signatures, doesn't create them
    • Modified infrastructure cannot access device keys
  4. Interoperability Requires Compliance

    • Non-compliant forks cannot participate in ledger
    • Attestations from non-compliant deployments are rejected
    • Enterprise customers lose certification

Example Attack (Why It Fails):

// Malicious self-hosted deployment tries to mint human identity
async function evilAdminMintHuman() {
  const fakePassport = {
    did: 'did:human:evil-admin-123',
    publicKey: generateKeyPair().publicKey,
    // ... other fields
  };
  
  await db.passports.insert(fakePassport);
  
  // ❌ This fails because:
  // 1. No device attestation (requires Secure Enclave signature)
  // 2. DID not registered on distributed ledger
  // 3. Cannot sign with fake private key (device holds real key)
  // 4. Other deployments reject attestations from this DID
  // 5. Humans won't trust this "passport" (no provenance)
  
  return fakePassport; // Locally stored, but useless
}

Result: Modified code can create database records, but not valid identities.


Comparison: Self-Hosted HUMAN vs Self-Hosted Traditional Identity

Dimension Traditional IdP (Okta, Auth0) HUMAN Self-Hosted
Identity Creation Admin creates users Only devices create humans
Key Storage Server-side (HSM) Device-only (Secure Enclave)
Admin Override Admins can reset passwords Admins cannot access human keys
Impersonation Risk Admin can impersonate users Cryptographically impossible
Compromise Blast Radius All users (admin has master keys) Org only (humans unaffected)
Recovery Admin initiates Guardian quorum (other humans)
Portability Vendor lock-in Globally portable DID

Why This Matters:

Traditional self-hosted identity gives administrators god-mode access. HUMAN self-hosted gives administrators operational control without identity authority.

This is the architectural innovation that makes self-hosted deployments safe at scale.


Compliance Statement

For regulated industries:

Self-hosted HUMAN deployments comply with:

  • HIPAA (patient identity sovereignty)
  • GDPR (data subject rights, right to portability)
  • eIDAS (qualified electronic signatures)
  • SOC 2 (cryptographic key management)
  • Zero Trust Architecture (continuous verification, no implicit trust)

Certification:

Self-hosted deployments that violate identity minting rules:

  • Lose HUMAN Protocol certification
  • Cannot interoperate with HUMAN Cloud or other compliant deployments
  • Lose vendor support and updates
  • Risk regulatory non-compliance

Audit Trail:

All self-hosted deployments must:

  • Log all identity verification events
  • Maintain provenance for all attestations
  • Participate in distributed ledger (or run private ledger node)
  • Submit to periodic compliance audits (for certification)

Why This Architecture Wins

For Enterprises:

  • Self-hosting gives control without creating liability
  • Infrastructure compromise doesn't expose human identity
  • Clear blast radius (org only, not global)
  • Regulatory compliance by design

For Humans:

  • Identity sovereignty maintained even in self-hosted deployments
  • Can leave org without losing identity
  • Cannot be impersonated by admins
  • Portable across all deployment types

For HUMAN:

  • Open source doesn't compromise security
  • Self-hosted doesn't fragment protocol
  • Network effects reinforce standards
  • Trust derives from cryptography, not vendor control

Result: Self-hosted deployments are safe, compliant, and strategically valuable β€” not a security liability.


AI-POWERED INSTALLATION AUTOMATION

The Problem with Traditional Self-Hosting:

  • 1-2 weeks to deploy
  • Requires Kubernetes expertise
  • 1266 lines of manual YAML
  • High error rate, high support burden
  • Blocks SMBs from self-hosting

HUMAN's Solution: Installation as a Conversation

Self-hosted HUMAN installs in 5-15 minutes through three paths:

Installation Path 1: Companion Installer (Conversational)

Natural language installation for technical decision-makers:

User: "I want to self-host HUMAN in our AWS VPC for 50 agents with HIPAA compliance"

Companion Installer:

  • Detects environment (AWS EKS, RDS available, VPC config)
  • Asks 5 clarifying questions (HA requirements, air-gap, integrations)
  • Generates optimal configuration (capacity planning, compliance hardening)
  • Executes automated installation with human approval at critical steps
  • Validates deployment health
  • Provides dashboard access + admin credentials

Time: 8-15 minutes
Human involvement: Approve 3-5 critical decisions
Expertise required: Understand business requirements (not YAML)

Installation Path 2: Intelligent CLI

For engineers who prefer CLI:

$ npx @human/installer init

πŸ” Detecting environment...
   βœ… AWS EKS cluster detected (us-east-1)
   βœ… kubectl configured (v1.28)
   βœ… PostgreSQL RDS available

πŸ“‹ Configuration wizard (5 questions):
   ? Agent capacity: 50 agents
   ? High availability: Yes (multi-AZ)
   ? Compliance: HIPAA
   ? Air-gapped: No

πŸ—οΈ Installing HUMAN...
   [Progress bars for each component]

✨ Complete (12m 34s)

Time: 10-20 minutes
Human involvement: Answer 5 questions
Expertise required: Basic cloud/k8s familiarity

Installation Path 3: Cloud Marketplace

One-click deployment for enterprises:

  • AWS Marketplace β†’ Click "Launch" β†’ HUMAN deployed in 10 minutes
  • GCP Marketplace β†’ Same experience
  • Azure Marketplace β†’ Same experience

Time: 5-10 minutes (fully automated)
Human involvement: Click "Subscribe"
Expertise required: None

Installation Path 4: Manual (Advanced)

For maximum customization or air-gapped environments with no installer access:

  • Follow detailed implementation spec: setup/agent_deployment_selfhosted_spec.md
  • Manual Helm/kubectl commands
  • Full control over every configuration detail

Time: 1-2 weeks
Human involvement: Full manual configuration
Expertise required: Deep Kubernetes/infrastructure knowledge


How AI-Powered Installation Works

Environment Detection:

  • Cloud provider (AWS, GCP, Azure, bare metal)
  • Kubernetes version and capabilities
  • Existing infrastructure (databases, storage, monitoring)
  • Network configuration (VPC, subnets, security groups)
  • Compliance posture (encryption, audit logs)

Configuration Generation:

  • Capacity planning (CPU, memory, storage based on agent count)
  • Compliance templates (HIPAA, FedRAMP, PCI hardening)
  • High availability (multi-AZ, failover, backup)
  • Cost optimization (minimum viable resources)
  • Security best practices (CIS benchmarks, zero-trust)

Automated Installation:

  • Pre-flight validation (capacity, permissions, connectivity)
  • Kubernetes resource creation (namespaces, deployments, services)
  • Database schema migration
  • Secrets management (encryption at rest)
  • Network policies (zero-trust networking)
  • Monitoring stack deployment (Prometheus, Grafana)

Post-Install Validation:

  • All pods healthy
  • Database connectivity
  • API responsiveness
  • Agent registration works
  • Storage accessible
  • Monitoring operational

Human-in-the-Loop:

  • Approve critical decisions (database connection, secrets creation)
  • Review generated configuration before apply
  • Escalation on errors (with remediation suggestions)

Why This Matters

Before AI-powered installation:

  • Self-hosting = enterprise-only (requires dedicated ops team)
  • SMBs blocked from data sovereignty
  • High support burden for HUMAN
  • Slow adoption, high friction

After AI-powered installation:

  • Self-hosting = accessible to SMBs
  • 5-15 minute setup (vs 1-2 weeks)
  • 95% success rate (vs ~60% manual)

  • Low support burden
  • Fast adoption, low friction

This is Living HAIO: AI agents installing and configuring AI agent infrastructure.

Status: Vision documented (this PRD). Implementation: Q1 2026.
See: KB 50 (Human Agent Design) for agent architecture.


SELF-HOSTED ENTERPRISE REQUIREMENTS

Licensing & Enforcement

License Types:

License Type Annual Price Agent Limit Support Level Use Case
Development $0 5 Community Testing, staging environments
Production (Node-Locked) $30,000 200 Standard Single datacenter deployment
Production (Floating) $50,000 200 Standard Multi-datacenter with failover
Enterprise (Unlimited) $100,000+ Unlimited Enterprise + TAM Global deployments, MSPs

Enforcement Mechanism:

  • License key validated on control plane startup
  • Cryptographic signature verification
  • Phone-home validation (once per 24hr, optional for air-gapped)
  • Grace period: 30 days after expiry (with warnings)
  • Air-gapped: Offline license validation via signed JWT

License Renewal:

  • Automated renewal reminders (90, 60, 30, 7 days)
  • Zero-downtime renewal (hot-swap license keys)
  • Volume discounts for multi-year contracts

Support & Service Level Agreements

Support Tiers:

Severity Response Time Resolution Target Channels Included In
P0 (System Down) <1 hour <4 hours Phone, Slack, Email Enterprise
P1 (Critical Impact) <4 hours <24 hours Slack, Email Standard+
P2 (Moderate Impact) <8 hours <3 days Email, Portal Standard+
P3 (Low Impact) <24 hours <7 days Portal All (incl Community)

Support Access Requirements:

  • Standard: Business hours (9-5 local time), email + portal
  • Enterprise: 24/7, dedicated Slack channel, phone, TAM assigned
  • Community: Forums, GitHub issues, community Slack (best-effort)

Enterprise Support Add-Ons:

  • Technical Account Manager (TAM): +$10k/year
  • Professional Services: $250/hour
  • Onsite Training: $2k/person (2-day workshop)
  • Compliance Certification Support: $15k/year (HIPAA, FedRAMP guidance)

Total Cost of Ownership (TCO) Analysis

TCO Comparison: Hosted vs Self-Hosted (50 agents, 3 years)

Cost Component HUMAN-Hosted Self-Hosted
Software License $0 (usage-based) $30k/yr Γ— 3 = $90k
Infrastructure Included $3.2k/mo Γ— 36 = $115k
Operational Labor Included 0.5 FTE Γ— 3yr = $180k
Support Included $0 (Standard incl)
Upgrades Automated Included
Total (3yr) ~$180k ~$385k

Break-Even Analysis:

  • Self-hosted TCO higher for <100 agents
  • Break-even at ~150-200 agents (3-year horizon)
  • Self-hosted wins for >200 agents OR data sovereignty required

When Self-Hosted Makes Sense:

  • Regulated industries (HIPAA, FedRAMP, PCI)
  • Air-gapped environments (defense, classified)
  • Data sovereignty requirements (EU, China, government)
  • Very high scale (>200 agents)
  • Existing infrastructure (sunk costs in datacenter)

When Hosted Makes Sense:

  • Small deployments (<50 agents)
  • Fast time-to-value (no infrastructure burden)
  • Variable workloads (pay-as-you-go)
  • No ops team available

Reference Architectures

Small Enterprise (5-20 agents):

  • Kubernetes: 3 nodes, 4vCPU, 8GB each
  • Database: PostgreSQL (8vCPU, 32GB, Multi-AZ)
  • Storage: 500GB SSD
  • Estimated cost: $1,050/month infrastructure + $30k/yr license

Medium Enterprise (20-100 agents):

  • Kubernetes: 10 nodes, 8vCPU, 16GB each
  • Database: PostgreSQL (16vCPU, 64GB, Multi-AZ + replicas)
  • Storage: 2TB SSD
  • Estimated cost: $3,200/month infrastructure + $50k/yr license

Large Enterprise (100-500 agents):

  • Kubernetes: 30 nodes, 16vCPU, 32GB each
  • Database: PostgreSQL (32vCPU, 128GB, Multi-AZ + read replicas)
  • Storage: 10TB SSD
  • Multi-region deployment (primary + DR)
  • Estimated cost: $12k/month infrastructure + $100k/yr license

Global Deployment (500+ agents, multi-region):

  • Kubernetes: 100+ nodes across 3+ regions
  • Database: Distributed PostgreSQL (CitusDB or similar)
  • Storage: 50TB+ distributed
  • Multi-cloud (AWS + Azure for resilience)
  • Estimated cost: $50k+/month infrastructure + custom licensing

COMPLIANCE READINESS FOR SELF-HOSTED

HIPAA Compliance

HUMAN provides:

  • Encryption at rest (database, storage, secrets)
  • Encryption in transit (TLS 1.3)
  • Audit logging (all access, all actions)
  • Access controls (RBAC, MFA)
  • Business Associate Agreement (BAA) template

Customer responsible for:

  • Administrative safeguards (policies, training)
  • Physical safeguards (datacenter security)
  • Technical safeguards (network security, backups)

HIPAA-Specific Configuration:

compliance:
  hipaa:
    enabled: true
    auditLogging:
      retention: 6years  # HIPAA requirement
      immutable: true
    encryption:
      algorithm: AES-256-GCM
      keyRotation: 90days
    accessControls:
      mfaRequired: true
      sessionTimeout: 15min

HIPAA Checklist: See compliance document docs/compliance/self-hosted-checklists.md


FedRAMP Compliance (Moderate Baseline)

HUMAN provides:

  • Automated compliance configuration templates
  • Control implementation documentation
  • Continuous monitoring dashboards
  • Incident response runbooks

Customer responsible for:

  • Full FedRAMP authorization package
  • Third-party assessment organization (3PAO) audit
  • Continuous monitoring (ConMon) program

FedRAMP Support:

  • HUMAN can provide FedRAMP compliance support: $15k/year
  • Includes: Control mapping, documentation templates, audit support

Note: Full FedRAMP authorization is a 12-18 month process. HUMAN provides technical controls; customer owns authorization.


PCI-DSS Compliance

Applicable if: Processing, storing, or transmitting cardholder data

HUMAN provides:

  • Network segmentation (Kubernetes network policies)
  • Encrypted storage and transmission
  • Access control and logging
  • Vulnerability management guidance

Customer responsible for:

  • PCI-DSS compliance validation (QSA or SAQ)
  • Cardholder data environment (CDE) segmentation
  • Regular penetration testing

GDPR Compliance

HUMAN provides:

  • Data portability (export APIs)
  • Right to erasure (deletion APIs)
  • Data processing agreements (DPA)
  • Privacy-by-design architecture

Customer responsible for:

  • Lawful basis for processing
  • Data subject consent management
  • Data protection impact assessments (DPIA)
  • GDPR compliance program

AIR-GAPPED OPERATIONS (EXTENDED)

Update Distribution Methods

For environments with no external connectivity:

Method 1: USB Transfer

  • Download update bundle from HUMAN portal (authenticated)
  • Transfer via USB to air-gapped environment
  • Verify cryptographic signature
  • Apply via installer CLI

Method 2: Secure FTP

  • HUMAN pushes updates to customer-controlled SFTP
  • Customer pulls to air-gapped environment
  • Signature verification required

Method 3: Courier (High-Security)

  • Physical media shipment for classified environments
  • Tamper-evident packaging
  • Chain-of-custody documentation

Update Bundle Contents

human-v1.2.0-airgapped.tar.gz (signed)
β”œβ”€β”€ helm-charts/           # Versioned Helm charts
β”œβ”€β”€ container-images/      # All Docker images (no registry pulls)
β”œβ”€β”€ database-migrations/   # SQL migration scripts
β”œβ”€β”€ installer-cli/         # Offline installer binary
β”œβ”€β”€ license-validator/     # Offline license validation
β”œβ”€β”€ checksums.txt          # SHA256 of all files
└── signature.sig          # GPG signature for verification

Local LLM Integration

For air-gapped environments requiring AI capabilities:

Supported Local LLM Providers:

  • Ollama (easiest setup)
  • vLLM (high performance)
  • LocalAI (model-agnostic)

Configuration:

llm:
  provider: ollama
  endpoint: http://ollama.internal:11434
  model: llama2:70b
  airgapped: true
  fallback: none  # No external API calls

Model Distribution:

  • Models included in air-gapped bundle OR
  • Customer downloads separately and transfers

Air-Gapped Certificate Management

Challenge: No external Certificate Authority (CA) access

Solution: Internal CA

tls:
  ca: internal
  certPath: /etc/human/certs/
  keyPath: /etc/human/keys/
  renewalStrategy: manual  # No ACME in air-gapped

Process:

  1. Generate internal CA (one-time)
  2. Issue certificates for HUMAN components
  3. Distribute CA cert to all clients
  4. Manual renewal before expiry (alerts at 30/60/90 days)

Offline License Validation

Standard licensing: Phone-home validation (24hr interval)

Air-gapped licensing: Signed JWT with long expiry

license:
  type: airgapped
  key: <signed-jwt-with-6month-expiry>
  validation: offline
  renewal: manual  # Requires new JWT from HUMAN

Renewal process:

  1. Generate renewal request (includes deployment ID)
  2. Transfer request to connected environment
  3. Submit to HUMAN portal
  4. Receive new signed JWT
  5. Transfer back to air-gapped environment
  6. Apply new license (zero-downtime)

Fallback & Degraded Mode

If critical services unavailable in air-gapped:

  • LLM unavailable β†’ Route to human-only workflow
  • Monitoring unavailable β†’ Local logging only
  • License validation unavailable β†’ Grace period (30 days warning)

Principle: System remains operational, degraded gracefully


PERFORMANCE BENCHMARKS & CAPACITY PLANNING

Capacity Planning Formulas

Kubernetes Nodes:

nodes_required = ceil(agent_count / 10)  # 10 agents per node (8vCPU, 16GB)
+ 3  # Control plane nodes (HA)
+ 2  # Monitoring nodes

Database:

db_cpu = max(8, agent_count / 25)  # 1 vCPU per 25 agents
db_memory_gb = max(32, agent_count * 0.5)  # 500MB per agent
db_storage_gb = max(100, agent_count * 2)  # 2GB per agent (logs, history)

Redis:

redis_memory_gb = agent_count * 0.1  # 100MB per agent (session cache)

Network Bandwidth:

bandwidth_mbps = agent_count * 5  # 5 Mbps per active agent

Example: 200 Agent Deployment

Kubernetes:

  • Nodes: ceil(200/10) + 3 + 2 = 25 nodes (8vCPU, 16GB each)
  • Total: 200 vCPU, 400GB RAM

Database:

  • CPU: max(8, 200/25) = 8 vCPU
  • Memory: max(32, 200*0.5) = 100GB
  • Storage: max(100, 200*2) = 400GB

Redis:

  • Memory: 200*0.1 = 20GB (clustered)

Network:

  • Bandwidth: 200*5 = 1 Gbps

Estimated Infrastructure Cost:

  • ~$8k/month (AWS pricing)

Performance Targets

Metric Target Measurement
API Latency (p50) <100ms Time from request to response
API Latency (p99) <500ms 99th percentile
Agent Registration <5s Time to register new agent
Task Assignment <2s Time from task creation to assignment
Database Query (p95) <50ms 95th percentile query time
Failover Time <30s Primary node failure to recovery
Throughput 10k req/s Sustained request rate (per region)

Load Testing Recommendations

Before production launch:

# Install k6 load testing tool
$ helm install k6-operator k6/k6-operator

# Run load test (simulates 100 agents)
$ k6 run --vus 100 --duration 30m load-test.js

Checks:
  βœ… API latency p95 < 200ms
  βœ… Error rate < 0.1%
  βœ… Database connections stable
  βœ… Memory usage < 80%
  βœ… CPU usage < 70%

Performance Tuning

Database Tuning (PostgreSQL):

-- Increase connection pool
max_connections = 500

-- Optimize for read-heavy workload
shared_buffers = 8GB
effective_cache_size = 24GB

Kubernetes Tuning:

# Horizontal Pod Autoscaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Redis Tuning:

# Redis cluster mode for >10GB data
redis:
  cluster:
    enabled: true
    nodes: 6  # 3 masters + 3 replicas
    maxmemory: 20gb
    maxmemory-policy: allkeys-lru

Monitoring Key Metrics

Infrastructure:

  • CPU utilization (target: <70%)
  • Memory utilization (target: <80%)
  • Disk IOPS (target: <80% capacity)
  • Network throughput

Application:

  • API request rate
  • API error rate (target: <0.1%)
  • API latency (p50, p95, p99)
  • Database query time
  • Cache hit rate (target: >90%)

Business:

  • Active agents
  • Tasks completed per hour
  • Agent utilization rate
  • Escalation rate

MULTI-TENANCY IN SELF-HOSTED DEPLOYMENTS

Use Cases

Managed Service Providers (MSPs):

  • MSP operates single HUMAN deployment
  • Serves multiple client organizations
  • Full isolation between clients

System Integrators (SIs):

  • SI deploys HUMAN for multiple divisions/subsidiaries
  • Shared infrastructure, isolated data

Holding Companies:

  • Parent company runs HUMAN
  • Subsidiaries use as tenants
  • Centralized billing, distributed usage

Architecture: Namespace Isolation

graph TB
    subgraph HUMAN_Control_Plane [HUMAN Control Plane]
        TenantRouter[Tenant Router]
    end
    
    subgraph Tenant_A [Tenant A: Acme Corp]
        NS_A[Namespace: tenant-acme]
        Agents_A[Agents 1-50]
        DB_A[Database Schema: acme]
    end
    
    subgraph Tenant_B [Tenant B: GlobalCo]
        NS_B[Namespace: tenant-globalco]
        Agents_B[Agents 1-100]
        DB_B[Database Schema: globalco]
    end
    
    TenantRouter --> NS_A
    TenantRouter --> NS_B
    NS_A --> Agents_A
    NS_A --> DB_A
    NS_B --> Agents_B
    NS_B --> DB_B

Isolation Guarantees

Network Isolation:

  • Kubernetes NetworkPolicy (deny-all by default)
  • Traffic between tenants blocked
  • Ingress only via tenant-specific endpoints

Compute Isolation:

  • Separate namespaces per tenant
  • ResourceQuotas enforced (CPU, memory, pods)
  • No shared pods between tenants

Data Isolation:

  • Separate database schemas per tenant
  • Row-level security (RLS) for shared tables
  • Encryption keys unique per tenant

Access Isolation:

  • Separate RBAC policies per tenant
  • Tenant admins cannot access other tenants
  • MSP admin has cross-tenant visibility (audit only)

Kubernetes Configuration

Namespace per Tenant:

apiVersion: v1
kind: Namespace
metadata:
  name: tenant-acme
  labels:
    tenant-id: acme
    msp-managed: "true"

ResourceQuota:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: acme-quota
  namespace: tenant-acme
spec:
  hard:
    requests.cpu: "50"      # 50 vCPU
    requests.memory: 100Gi  # 100GB RAM
    pods: "100"             # Max 100 pods
    persistentvolumeclaims: "10"

NetworkPolicy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-cross-tenant
  namespace: tenant-acme
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          tenant-id: acme  # Only same tenant

Licensing for Multi-Tenant MSPs

MSP License:

  • Unlimited tenants
  • Agent count = sum across all tenants
  • Pricing: $100k/yr base + $200/agent/yr

Example:

  • MSP serves 10 clients
  • Total agents: 500
  • Cost: $100k + (500 Γ— $200) = $200k/yr

Alternative: Per-Tenant Licensing

  • Each tenant purchases own license
  • MSP provides infrastructure only
  • HUMAN bills tenants directly

Security Considerations

MSP Responsibilities:

  • Network segmentation enforcement
  • Resource quota management
  • Monitoring and alerting (per-tenant dashboards)
  • Backup and disaster recovery (tenant data isolated)

Tenant Responsibilities:

  • Application-level access control (who can use agents)
  • Compliance with regulations (HIPAA, etc.)
  • Agent configuration and management

HUMAN's Role:

  • Provide secure multi-tenant architecture
  • License enforcement (per tenant)
  • Support MSP and tenants (tiered support model)

ENTERPRISE INTEGRATION PATTERNS

Identity Federation

Supported Protocols:

Protocol Use Case Complexity Recommended For
SAML 2.0 Enterprise SSO Medium Large enterprises, government
OAuth2/OIDC Modern apps Low Tech companies, SaaS
LDAP/AD Legacy systems High Traditional enterprises

SAML 2.0 Configuration:

auth:
  provider: saml
  saml:
    entryPoint: https://idp.acme.com/sso
    issuer: https://human.acme.internal
    cert: /etc/human/saml/idp-cert.pem
    identifierFormat: urn:oasis:names:tc:SAML:1.1:nameid-format:emailAddress
    attributeMapping:
      email: http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress
      firstName: http://schemas.xmlsoap.org/ws/2005/05/identity/claims/givenname
      lastName: http://schemas.xmlsoap.org/ws/2005/05/identity/claims/surname

LDAP Configuration:

auth:
  provider: ldap
  ldap:
    url: ldaps://ldap.acme.com:636
    bindDN: cn=human-service,ou=services,dc=acme,dc=com
    bindPassword: <secret>
    searchBase: ou=users,dc=acme,dc=com
    searchFilter: (uid={{username}})
    groupSearchBase: ou=groups,dc=acme,dc=com
    groupMemberAttribute: memberOf

Corporate Proxy Support

For enterprises with mandatory proxy:

network:
  proxy:
    http: http://proxy.acme.com:8080
    https: http://proxy.acme.com:8080
    noProxy:
      - localhost
      - 127.0.0.1
      - .acme.internal
      - .svc.cluster.local
  caCerts:
    - /etc/ssl/certs/acme-root-ca.crt

VPN & Private Connectivity

AWS Direct Connect:

  • Private connection to HUMAN-hosted (hybrid deployment)
  • Latency: <10ms
  • Bandwidth: 1-100 Gbps

Azure ExpressRoute:

  • Private peering to HUMAN control plane
  • Redundant connections across regions

GCP Cloud Interconnect:

  • Dedicated interconnect for high throughput

Site-to-Site VPN:

  • IPsec tunnels for smaller deployments
  • Encrypted traffic over internet

Custom Certificate Authority

For enterprises with internal CA:

tls:
  ca: custom
  customCA:
    rootCert: /etc/human/ca/root.crt
    intermediateCerts:
      - /etc/human/ca/intermediate1.crt
      - /etc/human/ca/intermediate2.crt
  certManager:
    enabled: true
    issuer: acme-internal-ca

SIEM Integration

Supported SIEM Platforms:

Splunk:

logging:
  siem:
    provider: splunk
    endpoint: https://splunk.acme.com:8088
    token: <hec-token>
    index: human_logs
    sourcetype: human:json

Microsoft Sentinel:

logging:
  siem:
    provider: sentinel
    workspaceId: <workspace-id>
    sharedKey: <shared-key>
    logType: HumanAgentLogs

IBM QRadar:

logging:
  siem:
    provider: qradar
    endpoint: https://qradar.acme.com
    syslogPort: 514
    protocol: tcp

DLP Integration

For enterprises with Data Loss Prevention:

security:
  dlp:
    enabled: true
    provider: symantec  # or forcepoint, mcafee
    endpoint: https://dlp.acme.com/api
    scanOutbound: true
    blockOnViolation: true
    alertOnSuspicious: true

UPGRADE STRATEGY & BREAKING CHANGES

Release Cadence

Release Type Frequency Version Change Contents
Major Annual 1.0 β†’ 2.0 Breaking changes, new features
Minor Quarterly 1.1 β†’ 1.2 New features, no breaking changes
Patch Monthly 1.1.1 β†’ 1.1.2 Bug fixes, security patches

Semantic Versioning

Format: MAJOR.MINOR.PATCH (e.g., 1.2.3)

  • MAJOR: Breaking API changes, requires migration
  • MINOR: New features, backward compatible
  • PATCH: Bug fixes, security patches

Breaking Change Policy

Announcement: 90 days before release
Migration Guide: Published with announcement
Support: Old version supported for 12 months after new major release

Example Timeline:

  • Day 0: Announce v2.0 (breaking changes)
  • Day 90: Release v2.0
  • Day 90-Day 455: Support both v1.x and v2.x
  • Day 455: End support for v1.x

Upgrade Process (Zero-Downtime)

Blue-Green Deployment:

# 1. Deploy new version (green) alongside old (blue)
$ helm install human-v2 human/control-plane \
  --namespace human-green \
  --set version=2.0.0

# 2. Validate green environment
$ human-installer validate --namespace human-green
  βœ… All health checks passed

# 3. Switch traffic to green (gradual)
$ kubectl patch ingress human --type merge \
  -p '{"spec":{"rules":[{"host":"human.acme.internal","http":{"paths":[{"path":"/","pathType":"Prefix","backend":{"service":{"name":"api-gateway-green","port":{"number":8080}}}}]}}]}}'

# 4. Monitor for 24 hours
# 5. Decommission blue environment
$ helm uninstall human-v1 --namespace human

Compatibility Matrix

Control Plane Agent SDK Database Schema Supported
1.2.x 1.2.x v1.2 βœ… Yes
1.2.x 1.1.x v1.2 βœ… Yes (N-1 support)
1.2.x 1.0.x v1.2 ❌ No (upgrade agents)
2.0.x 1.2.x v2.0 ❌ No (breaking change)

Policy: Control plane supports agent SDK from previous minor version (N-1).


Database Migration Safety

Automated Migrations:

  • All migrations tested against copy of production data
  • Rollback plan for every migration
  • Execution time estimated and validated

Example Migration:

-- Migration: v1.2.0 β†’ v1.3.0
-- Estimated time: 15 minutes (10M rows)
-- Rollback: Available (DROP COLUMN is reversible)

BEGIN;

-- Add new column
ALTER TABLE agents ADD COLUMN status_v2 VARCHAR(50);

-- Migrate data (batched)
UPDATE agents SET status_v2 = status WHERE status_v2 IS NULL;

-- Once validated, drop old column (future migration)
-- ALTER TABLE agents DROP COLUMN status;

COMMIT;

Automated Upgrade Testing

Pre-release validation:

# Run upgrade test suite
$ human-test-upgrade --from 1.2.0 --to 1.3.0

Tests:
  βœ… Database migration (15m 32s)
  βœ… API compatibility (all endpoints)
  βœ… Agent SDK compatibility (1.2.x β†’ 1.3.x)
  βœ… Zero-downtime switchover
  βœ… Rollback procedure
  βœ… Performance benchmarks (no regression)

Result: Safe to upgrade

Upgrade Checklist

Pre-Upgrade:

  • Review release notes and migration guide
  • Backup all data (database, configs, secrets)
  • Test upgrade in staging environment
  • Schedule maintenance window (or plan zero-downtime)
  • Notify users of potential disruption

During Upgrade:

  • Deploy new version (blue-green)
  • Run database migrations
  • Validate new environment health
  • Switch traffic gradually (10% β†’ 50% β†’ 100%)
  • Monitor error rates and latency

Post-Upgrade:

  • Validate all critical workflows
  • Check monitoring dashboards
  • Verify agent registration
  • Confirm database performance
  • Decommission old environment (after 24hr)

DAY 2 OPERATIONS FOR SELF-HOSTED

Operational Responsibilities

Task Frequency Owner Automation
Database backups Daily Customer Ops Automated (Velero)
Security patches Weekly Customer Ops Semi-automated (Helm)
Certificate renewal 30 days before expiry Customer Ops Automated (cert-manager)
Capacity review Monthly Customer Ops Dashboard-driven
Performance tuning Quarterly Customer Ops + HUMAN TAM Guided
Disaster recovery drill Quarterly Customer Ops Scripted
Compliance audit Annual Customer Compliance HUMAN support available

Automated Backup Strategy

Velero Configuration:

# Backup schedule
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: human-daily-backup
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    includedNamespaces:
    - human
    - human-runtime
    storageLocation: aws-s3
    volumeSnapshotLocations:
    - aws-ebs
    ttl: 720h  # 30 days retention

Database Backup:

# PostgreSQL automated backup (via cron)
0 2 * * * pg_dump -h postgres.acme.internal -U human human | \
  gzip > /backups/human-$(date +\%Y\%m\%d).sql.gz

# Retention: 30 days local, 1 year S3

Disaster Recovery Procedures

Recovery Time Objective (RTO): 1 hour
Recovery Point Objective (RPO): 24 hours

DR Scenario 1: Database Failure

# 1. Promote read replica to primary
$ aws rds promote-read-replica --db-instance-identifier human-db-replica-1

# 2. Update connection string
$ kubectl patch configmap human-config \
  -p '{"data":{"DB_HOST":"human-db-replica-1.xyz.rds.amazonaws.com"}}'

# 3. Restart affected pods
$ kubectl rollout restart deployment --all -n human

# RTO: ~15 minutes

DR Scenario 2: Complete Region Failure

# 1. Failover to DR region
$ kubectl config use-context human-dr-us-west-2

# 2. Restore from backup
$ velero restore create --from-backup human-daily-backup-20250101

# 3. Update DNS (Route53 or equivalent)
$ aws route53 change-resource-record-sets --hosted-zone-id Z123 \
  --change-batch file://failover-dns.json

# RTO: ~1 hour

DR Drill Procedure

Quarterly drill (4 hours):

  1. Hour 1: Simulate region failure

    • Take primary region offline (controlled)
    • Measure detection time (<5 min target)
  2. Hour 2: Execute failover

    • Promote DR region
    • Restore from backup
    • Validate data integrity
  3. Hour 3: Validate DR environment

    • Run health checks
    • Test agent registration
    • Verify API functionality
  4. Hour 4: Failback to primary

    • Sync data from DR to primary
    • Switchback to primary region
    • Debrief and document improvements

Monitoring & Alerting

Critical Alerts (Page on-call):

Alert Threshold Response Time
Control plane down >50% pods unhealthy <5 min
Database connection failure >10% error rate <5 min
Disk space critical >90% full <15 min
Certificate expiring soon <7 days to expiry <24 hr
License expiring <30 days to expiry <24 hr

Warning Alerts (Review next business day):

Alert Threshold Response Time
High CPU usage >80% for 30min <4 hr
High memory usage >85% for 30min <4 hr
Slow API responses p95 >500ms <4 hr
Failed backups 2 consecutive failures <12 hr

Performance Tuning (Monthly Review)

Checklist:

  • Review database slow query log (optimize queries >100ms)
  • Check cache hit rate (target: >90%)
  • Analyze resource utilization (CPU, memory, disk)
  • Review pod auto-scaling behavior (scale-up/down frequency)
  • Check for pod restarts (investigate if >5/day)
  • Review API error logs (investigate 4xx/5xx patterns)

Common Operations

Scale Up (Add Capacity):

# Add 10 more nodes
$ eksctl scale nodegroup --cluster=human-prod --nodes=35 --name=human-workers

# Adjust HPA max replicas
$ kubectl patch hpa human-api --patch '{"spec":{"maxReplicas":30}}'

Add Region (Multi-Region):

# Deploy to new region
$ human-installer install --region eu-west-1 --profile multi-region

# Configure cross-region replication
$ human-installer configure-replication \
  --primary us-east-1 \
  --replica eu-west-1 \
  --mode async

Rotate Secrets:

# Rotate database password (zero-downtime)
$ human-installer rotate-secret --name DB_PASSWORD --zero-downtime

Steps:
  1. Generate new password
  2. Add to database (secondary user)
  3. Update application to use new password
  4. Remove old password from database
  5. Validate no errors

IMPLEMENTATION DETAILS

For engineers building or deploying HUMAN agents, detailed implementation specs are available:

Setup Specifications

Each deployment profile has a complete implementation spec with infrastructure configs, monitoring setup, and deployment procedures:

  • Hosted Profile: setup/agent_deployment_hosted_spec.md

    • Zero-config deployment flow
    • What HUMAN manages (infrastructure, monitoring, security)
    • API access and authentication
    • Cost structure and visibility
  • Hybrid Profile: setup/agent_deployment_hybrid_spec.md

    • Control plane in HUMAN Cloud, execution in customer VPC
    • Secure tunnel configuration (mTLS, no inbound firewall rules)
    • Monitoring options (push to HUMAN Cloud OR self-hosted)
    • Data residency guarantees
  • Self-Hosted Profile: See implementation spec setup/agent_deployment_selfhosted_spec.md

    • Complete infrastructure requirements
    • Helm charts and Terraform modules
    • Database setup and network topology
    • Air-gapped deployment support

Monitoring Configurations

Comprehensive, copy-paste configs for all profiles:

  • setup/monitoring_configurations.md
    • Prometheus scraping configs (self-hosted)
    • Grafana dashboard JSONs (fleet overview, cost analytics, audit trail)
    • Alert rules (agent down, high error rate, budget alerts)
    • Distributed tracing setup (Tempo integration)
    • Log aggregation (Loki configuration)

Also see: KB 103 (Monitoring & Observability) for architectural overview and best practices.

Control Plane Architecture

  • setup/mara_humanos_control_plane_v0.2.md
    • Control plane deployment by profile
    • Routing, policy engine, approval queue
    • Async job system and workflow DAG construction
    • Cross-profile consistency guarantees

Agent SDK Patterns

  • setup/human_agent_sdk_patterns_v0.2.md
    • human.call() primitive (works identically across all profiles)
    • Delegation and risk classification
    • Context propagation and attestation generation
    • Profile-aware SDK configuration

Quick Reference

Need Document
Deploy to Hosted (zero-config) setup/agent_deployment_hosted_spec.md
Deploy to Hybrid (data sovereignty) setup/agent_deployment_hybrid_spec.md
Deploy Self-Hosted (full control) See implementation spec setup/agent_deployment_selfhosted_spec.md
Configure Prometheus/Grafana setup/monitoring_configurations.md
Understand control plane setup/mara_humanos_control_plane_v0.2.md
Build agents KB 105 (Agent SDK Architecture), KB 130 (Design Patterns)

MIGRATION PATHS: BORING AND REVERSIBLE

We bake migration paths in from day one:

From Hosted β†’ Hybrid

Trigger: "We need attestations in our data lake" or "Compliance wants ledger in our region"

Process:

  1. Export Capability Graph state
  2. Export active policies
  3. Stand up ledger nodes in their VPC
  4. Configure HUMAN control plane to point to their ledger
  5. Test with shadow traffic
  6. Cut over

Downtime: Minutes (not hours)

Code changes required: Zero (just config)


From Hybrid β†’ Self-Hosted

Trigger: "Audit says we need full operational control" or "We're going multi-cloud"

Process:

  1. Deploy HumanOS services via Helm/Terraform
  2. Deploy Capability Graph nodes
  3. Configure storage adapters (their RDS, S3, etc.)
  4. Migrate control plane state via export/import
  5. Point apps to new HUMAN_BASE_URL
  6. Decommission hosted control plane

Downtime: Hours (planned maintenance window)

Code changes required: URL + credential changes only


From Hosted β†’ Self-Hosted (Skip Hybrid)

Trigger: "We're a 50-person healthcare startup, just got our first enterprise customer, need HIPAA self-hosted"

Process:

Same as Hosted β†’ Hybrid β†’ Self-hosted, but done in one shot with migration automation

Downtime: 1 day (planned)

Support: We provide migration engineer + runbooks


STORAGE ADAPTER ARCHITECTURE

Everything that persists state in HUMAN goes through a narrow interface:

The Three Stores

1. GraphStore

Stores: Capability Graph nodes and edges

Interface:

interface GraphStore {
  addNode(node: CapabilityNode): Promise<void>;
  addEdge(edge: CapabilityEdge): Promise<void>;
  queryCapabilities(query: CapabilityQuery): Promise<Capability[]>;
  updateCapability(id: string, update: CapabilityUpdate): Promise<void>;
}

Adapters:

  • HumanCloudGraphStore (our multi-tenant infra)
  • PostgresGraphStore (customer RDS/Aurora)
  • Neo4jGraphStore (native graph DB)
  • TigerGraphStore (high-performance alternative)

2. PolicyStore

Stores: HumanOS policies, rules, escalation configs

Interface:

interface PolicyStore {
  storePolicy(policy: Policy): Promise<void>;
  getPolicy(id: string): Promise<Policy>;
  evaluatePolicy(context: PolicyContext): Promise<PolicyDecision>;
  listPolicies(filter: PolicyFilter): Promise<Policy[]>;
}

Adapters:

  • HumanCloudPolicyStore
  • PostgresPolicyStore
  • S3PolicyStore (for large orgs with many policies)

3. LedgerStore

Stores: Attestations, provenance records, audit logs

Interface:

interface LedgerStore {
  anchor(attestation: Attestation): Promise<AnchorReceipt>;
  verify(id: string): Promise<VerificationResult>;
  query(filter: AttestationFilter): Promise<Attestation[]>;
  export(range: TimeRange): Promise<AuditExport>;
}

Adapters:

  • HumanCloudLedgerStore (hosted distributed ledger)
  • LocalLedgerStore (dev/test)
  • PrivateLedgerStore (customer-operated nodes)
  • SnowflakeLedgerStore (enterprise data lake integration)

ONBOARDING FLOWS BY PROFILE

Hosted Onboarding (SMB)

Step 1: Sign up with Google / O365

  • Auto-create HUMAN workspace tied to domain

Step 2: Install Companion

  • Browser extension + desktop app
  • Generates Passport keys locally on device

Step 3: Pick a starter pack

  • "AI customer support with human escalation"
  • "AI sales assistant with approval gates"
  • "AI recruiting assistant with human screen"

Step 4: Connect existing tools

  • OAuth to Gmail, Slack, CRM, etc.
  • We store pointers, not content

From their POV: No talk of VPC, DBs, S3 buckets. It just… works.


Hybrid Onboarding (Enterprise)

Step 1: Start with Hosted for pilot

  • Prove value with real workflows
  • Security evaluates during pilot

Step 2: Deploy data plane components

  • We provide Terraform modules
  • They deploy ledger + caches in VPC
  • Establish secure tunnel to HUMAN Cloud

Step 3: Migrate attestations

  • Historical data exports to their ledger
  • New attestations route to their infra

Step 4: Connect enterprise systems

  • SSO integration (Okta, Azure AD)
  • Private connectors to internal apps
  • VPC peering for sensitive workloads

From their POV: Same app experience, but attestations stay in our cloud, data plane in theirs.


Self-Hosted Onboarding (Regulated)

Step 1: Architecture review

  • HUMAN solutions architect + their platform team
  • Define: regions, storage, networking, compliance requirements

Step 2: Deploy via IaC

  • Helm charts for Kubernetes
  • Terraform for AWS/GCP/Azure
  • Ansible for on-prem

Step 3: Configure storage adapters

  • Point to their RDS, S3, Neo4j, Snowflake, etc.
  • Set retention policies, backup strategies

Step 4: Load test and validate

  • Run simulated governance load
  • Validate attestation integrity
  • Test failover scenarios

Step 5: Cut over production apps

  • Update HUMAN_BASE_URL in app configs
  • Monitor dashboards for anomalies

From their POV: Full control, full visibility, HUMAN becomes infrastructure they operate.


WHAT STAYS THE SAME ACROSS ALL PROFILES

No matter which deployment profile, these don't change:

1. API Surface

Same REST/GraphQL/gRPC endpoints:

  • /v1/passport/*
  • /v1/capabilities/*
  • /v1/humanos/*
  • /v1/attestations/*

2. SDKs

Same client libraries:

import { HumanClient } from '@human/sdk';

const client = new HumanClient({
  baseUrl: process.env.HUMAN_BASE_URL // <-- only thing that changes
});

3. Semantics

Same policy language, same attestation format, same capability model

4. Developer Experience

Same docs, same examples, same onboarding tutorials

Result: Moving between profiles is a URL change, not a rewrite.


PRICING IMPLICATIONS BY PROFILE

Updated: 2025-12-19

The Principle: Self-Hosting Changes Margin Mix, Not Core Engines

Whether HumanOS runs fully on HUMAN Cloud, hybrid, or fully self-hosted, we still charge for governed infrastructure, workforce access, and network effects.

What changes: Who pays for infra and our margin per customer
What doesn't change: Whether we get paid

The sovereign cockpit model means orgs pay for:

  1. The Platform (HumanOS license, Policy Engine, Reasoning Service)
  2. The Standards (certification, attestation formats, compliance)
  3. The Network (optional: workforce services, marketplace, cross-org governance)

They DON'T pay for "permission to make decisions."

See 34_revenue_engines_and_tam.md for complete pricing tiers and revenue model.


Hosted (HUMAN Cloud)

What customer pays us:

  • Platform license (based on tier: agents + instance capacity)
    • Free: $0/month (3 agents, 10 instances)
    • Starter: $49/month (10 agents, 50 instances)
    • Professional: $199/month (50 agents, 200 instances)
    • Business: $799/month (200 agents, 800 instances)
    • Enterprise: $2,500+/month (custom)
  • Infrastructure included (we run the compute)
  • Optional: HUMAN-managed reasoning (we front AI token costs)
  • Optional: Workforce services (when available in Phase 2)

What we pay (our COGS):

  • Compute per instance-hour (infrastructure costs)
  • AI tokens (if HUMAN-managed reasoning)
  • Support overhead

⚠️ Pricing Validation Note:

Hosted Infrastructure Costs: Instance-hour allowances and overage pricing require validation against production AWS/GCP costs. The tier structure and features are validated. Self-hosted pricing (below) is fully validated.

Economics:

  • Highest touch (we run everything)
  • Target margin: 60-70% after scale
  • Revenue: Platform license + infrastructure bundled

Customer profile:

  • Small businesses (5-100 people)
  • No IT/DevOps team
  • Want "it just works"
  • Comfortable with HUMAN-hosted

Example: 15-person law firm at $199/month Professional tier

  • Gets 50 agents, 200 concurrent instances
  • HUMAN handles all infrastructure
  • Firm focuses on using agents, not running them

Hybrid (HUMAN Cloud + Customer Infrastructure)

What customer pays us:

  • Platform license (same tiers as Hosted)
  • Partial infrastructure (we host some, they host sensitive workloads)
  • BYO keys (typically for on-prem reasoning)
  • Optional: Workforce services

What we pay:

  • Compute for HUMAN-hosted portion only
  • Zero costs for their self-hosted portion

What customer pays (their costs):

  • Their own infrastructure (VPC, compute for on-prem agents)
  • Their own AI token costs (for on-prem reasoning)

Economics:

  • Mixed margins (lower than pure hosted, higher than pure self-hosted)
  • Revenue: Platform license + partial infrastructure
  • Lower compute costs for us (they run sensitive stuff)

Customer profile:

  • Mid-size orgs (100-500 people)
  • Some IT capability
  • Mix of sensitive and non-sensitive workloads
  • Want flexibility (cloud for convenience, on-prem for compliance)

Example: 50-person hospital at $799/month Business tier

  • Runs PHI-touching agents on-prem (clinical notes, patient data)
  • Runs non-PHI agents on HUMAN Cloud (scheduling, billing)
  • Gets HIPAA compliance built-in
  • Hybrid = best of both worlds

Self-Hosted (Customer Infrastructure)

What customer pays us:

  • Platform license only (based on agents/users/scale)
    • No infrastructure fees (they run it)
    • No per-instance charges (they pay their own compute)
  • Support & certification (annual contract)
    • Premium support included
    • Quarterly business reviews
    • Certification services
  • Optional: Workforce services (when available)
  • Optional: Marketplace (we take rev share on installed agents)

What we pay:

  • Minimal control plane infrastructure (metadata only)
  • Support team costs

What customer pays (their costs):

  • All infrastructure (VPC, Kubernetes, databases, compute)
  • All AI token costs (their BYO keys)
  • Their own DevOps/SRE team

Economics:

  • Lowest touch for us (they run it)
  • Pure software licensing margins (80%+)
  • Revenue: License + support + optional services
  • Highest ACVs (enterprises pay more for control)

Customer profile:

  • Large enterprises (500+ people)
  • Mature platform engineering team
  • Regulated industries (finance, healthcare, government)
  • Want full control and data sovereignty

Example: 500-person bank at $30k/year Enterprise license

  • Runs everything on their AWS
  • Uses their own LLM cluster
  • HUMAN provides: software license, certification, support
  • Bank's total cost: $30k license + ~$40k their infra = $70k/year
  • Bank's value: Replaced $500k BPO contract + $2M fraud savings = 40x ROI

Pricing Summary Table

Deployment Platform License Infrastructure Support Our Margin Customer ACL
Hosted $49-799/mo tiers Included Email/Phone 60-70% $588-9.6k/year
Hybrid Same tiers Partial (we host some) Business 50-60% $1k-15k/year
Self-Hosted $30k+/year Customer pays Enterprise + TAM 80%+ $30k-100k+/year

Key insight: Self-hosted has highest margin (pure software) but requires enterprise sales motion. Hosted has lower margin but scales via self-serve.


Cost Flows by Deployment Mode

Hosted:

Customer pays: $799/month (Business tier)
β”œβ”€ To HUMAN: $799/month
   β”œβ”€ Platform license: $799
   β”œβ”€ Infrastructure: Included
   └─ AI tokens: Included (up to allowance)

HUMAN pays:
β”œβ”€ Compute: ~$200/month (infrastructure for their agents)
β”œβ”€ AI tokens: ~$150/month (reasoning calls)
β”œβ”€ Support: ~$50/month (allocated)
└─ Margin: ~$400/month (50%)

Hybrid:

Customer pays: $799/month + their AWS costs
β”œβ”€ To HUMAN: $799/month
   β”œβ”€ Platform license: $799
   β”œβ”€ Infrastructure: Partial (non-sensitive agents)
   └─ BYO keys for on-prem reasoning

β”œβ”€ To AWS (their bill): ~$300/month
   β”œβ”€ VPC for on-prem agents
   β”œβ”€ Compute for sensitive workloads
   └─ Their LLM endpoints

HUMAN pays:
β”œβ”€ Compute: ~$100/month (only non-sensitive portion)
β”œβ”€ Support: ~$50/month
└─ Margin: ~$650/month (81%)

Customer total cost: $1,099/month

Self-Hosted:

Customer pays: $30k/year license + their infrastructure
β”œβ”€ To HUMAN: $30k/year ($2,500/month)
   β”œβ”€ Platform license: $30k
   β”œβ”€ Support & certification: Included
   └─ Infrastructure: $0 (they run it)

β”œβ”€ To their cloud provider: ~$40k/year
   β”œβ”€ Kubernetes cluster
   β”œβ”€ Databases
   β”œβ”€ Compute for agents
   └─ Their LLM cluster

HUMAN pays:
β”œβ”€ Support team: ~$300/month (allocated)
β”œβ”€ Minimal infrastructure: ~$50/month (control plane metadata)
└─ Margin: ~$2,150/month (86%)

Customer total cost: $70k/year
Customer value delivered: $2.8M/year (savings + revenue)
ROI: 40x

Why This Model Works

For Small Businesses (Hosted):

  • Zero infrastructure burden
  • Predictable monthly cost
  • Scale up as they grow
  • Can migrate to hybrid/self-hosted later if needed

For Mid-Market (Hybrid):

  • Best of both worlds
  • Keep sensitive data on-prem
  • Use cloud for convenience
  • Optimize costs (don't pay us for compute they can run cheaper)

For Enterprises (Self-Hosted):

  • Full control and sovereignty
  • Data never leaves their infrastructure
  • Regulatory compliance built-in
  • Still get platform innovation (we ship updates)

For HUMAN:

  • Hosted = lower margin, higher volume (SMB focus)
  • Self-hosted = higher margin, lower volume (enterprise focus)
  • Both are profitable at scale
  • Revenue model survives regardless of deployment choice

Revenue Impact: Deployment Mix Over Time

Year 1 (Platform Launch):

  • 80% Hosted (SMBs discovering product)
  • 15% Hybrid (early mid-market)
  • 5% Self-Hosted (pilot enterprises)

Year 2 (Enterprise Adoption):

  • 60% Hosted (SMB growth continues)
  • 25% Hybrid (mid-market standard)
  • 15% Self-Hosted (enterprise momentum)

Year 3 (Enterprise Dominance):

  • 40% Hosted (by customer count, but lower ACVs)
  • 30% Hybrid (sweet spot for many)
  • 30% Self-Hosted (by revenue, highest ACVs)

Revenue distribution shifts even as customer mix doesn't:

  • Hosted customers: Many, but $49-799/month each
  • Self-hosted customers: Few, but $30k-100k/year each

By Year 3:

  • 10,000 hosted customers Γ— $200/month avg = $24M ARR
  • 500 hybrid customers Γ— $1,000/month avg = $6M ARR
  • 200 self-hosted customers Γ— $50k/year avg = $10M ARR
  • Total Platform Revenue: $40M ARR

Self-hosted is 2% of customers but 25% of platform revenue (and highest margin).


DECISION CRITERIA: WHICH PROFILE SHOULD A CUSTOMER CHOOSE?

Factor Hosted Hybrid Self-Hosted
Team Size <200 200–5,000 >1,000 or regulated
Infra Team None / small Exists Mature platform eng
Data Sensitivity Low–Medium Medium–High Highest
Compliance General Industry-specific Regulated (HIPAA, FedRAMP)
Speed to Value Minutes Days Weeks
OpEx Preference High (pay us) Mixed Low (run it themselves)
CapEx Willingness None Some High
Vendor Lock-in Concern Low Medium High

AVOIDING HOSTING AS A BARRIER

The traditional problem:

  • Big enterprises: "Cool idea, but it has to run in our VPC"
  • SMB: "Please don't make me think about any of that"

Our solution:

  • SMB: "It just works, you never see infra"
  • Enterprise: "Same APIs, runs in your VPC when you're ready"

The messaging becomes:

For SMB:
"Start here, you never have to touch infra."

For Enterprise:
"Start here, prove value, then shift into your VPC with the same code."


STORAGE AS NON-ISSUE

When an enterprise says: "We only use Snowflake / RDS / Azure SQL / Splunk"

We say: "Cool β€” here are the adapters, here's a reference deployment, your apps don't change."

The adapter pattern means:

  • HUMAN Cloud: optimized multi-tenant storage
  • Customer-hosted: we support their preferred vendors
  • Migration: export from ours, import to theirs, done

Result: Storage preference becomes a config option, not a deal-breaker.


WHY THIS ARCHITECTURE WORKS

1. Clean Boundaries

  • Devices own keys (Layer 0)
  • HUMAN owns coordination (Layer 1)
  • Customers own data (Layer 2)

These layers never blur.

2. Pluggable Storage

  • Everything behind narrow interfaces
  • Swap PostgreSQL for Neo4j? Config change.
  • Add Snowflake export? New adapter.

3. Same Semantics Everywhere

  • Hosted, Hybrid, Self-hosted: same protocol
  • No "enterprise edition" with different behavior
  • Migration is boring (the best kind of boring)

4. Revenue Flexibility

  • SMB: SaaS economics (high margin)
  • Enterprise: Mixed (medium margin, high ACV)
  • Self-hosted: Services (lower margin, highest ACV)

Every segment has a profitable path.


HUMAN'S OWN PRODUCTION INFRASTRUCTURE (4-NINES ARCHITECTURE)

This section describes how HUMAN operates its own Hosted profile infrastructure to achieve 99.99% availability.

Multi-Region Active-Active Architecture

HUMAN targets 99.99% (4 nines) availability = 4.3 minutes downtime/month.

To achieve this, HUMAN operates multi-region active-active (not active-passive):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Global DNS (Route53)                               β”‚
β”‚          Latency-based routing + health checks (10s interval)         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚                               β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  US-East-1     │◄─────────────►│  US-West-2        β”‚
    β”‚  (Active)      β”‚   Replication β”‚  (Active)         β”‚
    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€   <1s lag     β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
    β”‚ β€’ EKS: 10 pods β”‚               β”‚ β€’ EKS: 10 pods    β”‚
    β”‚ β€’ Load: 50%    β”‚               β”‚ β€’ Load: 50%       β”‚
    β”‚ β€’ RDS: Primary β”‚               β”‚ β€’ RDS: Replica    β”‚
    β”‚ β€’ Redis: Pri   β”‚               β”‚ β€’ Redis: Replica  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚                                  β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  Global State         β”‚
         β”‚ β€’ DynamoDB (global)   β”‚
         β”‚ β€’ S3 (multi-region)   β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Characteristics:

  • Both regions serve live traffic (50% each)
  • Either region can handle 100% load (capacity buffer)
  • Automated failover <30 seconds if one region fails
  • No single points of failure (distributed across 3+ AZs per region)
  • Data replicated in real-time (<1s lag)

Cost Impact:

  • Single region: ~$3,500/month
  • Multi-region active-active: ~$7,500/month
  • Additional cost: $4,000/month for 4-nines availability
  • ROI: Prevents customer SLA breaches and reputation damage

Regional Failover Automation

Terraform Configuration:

# terraform/modules/region/main.tf

module "us_east_1" {
  source = "./modules/region"
  
  region = "us-east-1"
  environment = "production"
  is_primary = true  # RDS primary (write)
  
  eks_node_count = 10
  rds_instance_class = "db.r6g.2xlarge"
  rds_multi_az = true
  
  replicate_to = ["us-west-2"]
}

module "us_west_2" {
  source = "./modules/region"
  
  region = "us-west-2"
  environment = "production"
  is_primary = false  # RDS read replica (can be promoted)
  
  eks_node_count = 10
  rds_instance_class = "db.r6g.2xlarge"
  rds_multi_az = true
  
  replicate_from = "us-east-1"
}

# Route53 health checks
resource "aws_route53_health_check" "us_east_1" {
  fqdn = "api.us-east-1.human.ai"
  port = 443
  type = "HTTPS"
  resource_path = "/health"
  request_interval = 10  # Check every 10 seconds
  failure_threshold = 2  # Fail after 2 consecutive failures (20s)
  
  tags = {
    Name = "US-East-1 Health Check"
  }
}

resource "aws_route53_health_check" "us_west_2" {
  fqdn = "api.us-west-2.human.ai"
  port = 443
  type = "HTTPS"
  resource_path = "/health"
  request_interval = 10
  failure_threshold = 2
  
  tags = {
    Name = "US-West-2 Health Check"
  }
}

# Global DNS with latency-based routing
resource "aws_route53_record" "api" {
  zone_id = aws_route53_zone.human_ai.id
  name = "api.human.ai"
  type = "A"
  
  set_identifier = "us-east-1"
  latency_routing_policy {
    region = "us-east-1"
  }
  health_check_id = aws_route53_health_check.us_east_1.id
  
  alias {
    name = module.us_east_1.load_balancer_dns
    zone_id = module.us_east_1.load_balancer_zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "api_west" {
  zone_id = aws_route53_zone.human_ai.id
  name = "api.human.ai"
  type = "A"
  
  set_identifier = "us-west-2"
  latency_routing_policy {
    region = "us-west-2"
  }
  health_check_id = aws_route53_health_check.us_west_2.id
  
  alias {
    name = module.us_west_2.load_balancer_dns
    zone_id = module.us_west_2.load_balancer_zone_id
    evaluate_target_health = true
  }
}

Automated RDS Failover Lambda:

// lambda/regional-failover.ts

export async function handleRegionalFailover(event: CloudWatchEvent) {
  const unhealthyRegion = event.detail.region;
  
  logger.critical({ unhealthyRegion }, 'Regional failover triggered');
  
  if (unhealthyRegion === 'us-east-1') {
    // Promote us-west-2 RDS replica to primary
    await rds.promoteReadReplica({
      DBClusterIdentifier: 'human-production-us-west-2',
    });
    
    logger.info('RDS replica promoted to primary');
    
    // Update Route53 weights (100% to us-west-2)
    await route53.changeResourceRecordSets({
      HostedZoneId: ZONE_ID,
      ChangeBatch: {
        Changes: [
          {
            Action: 'UPSERT',
            ResourceRecordSet: {
              Name: 'api.human.ai',
              Type: 'A',
              SetIdentifier: 'us-east-1',
              Weight: 0,  // Stop sending to us-east-1
            },
          },
          {
            Action: 'UPSERT',
            ResourceRecordSet: {
              Name: 'api.human.ai',
              Type: 'A',
              SetIdentifier: 'us-west-2',
              Weight: 100,  // Send 100% to us-west-2
            },
          },
        ],
      },
    });
    
    logger.info('Route53 updated to route to us-west-2');
  }
  
  // Page on-call
  await pagerduty.trigger({
    severity: 'critical',
    summary: `AUTOMATED REGIONAL FAILOVER: ${unhealthyRegion} β†’ healthy region`,
    details: {
      unhealthyRegion,
      estimatedDowntime: '20-30 seconds',
      rdsProm oted: true,
      dnsUpdated: true,
    },
  });
  
  // Log to provenance
  await provenance.log({
    actor: 'automation:regional-failover',
    action: 'promote_secondary_region',
    from: unhealthyRegion,
    automated: true,
  });
}

Database Multi-Region Strategy

Aurora Global Database:

# Primary cluster (us-east-1)
resource "aws_rds_cluster" "primary" {
  cluster_identifier = "human-production-primary"
  engine = "aurora-postgresql"
  engine_version = "15.3"
  engine_mode = "provisioned"
  
  master_username = "human_admin"
  master_password = data.aws_secretsmanager_secret_version.db_password.secret_string
  
  # Multi-AZ within region
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
  
  # Global database for cross-region replication
  global_cluster_identifier = "human-production-global"
  
  # Automated backups
  backup_retention_period = 30
  preferred_backup_window = "03:00-04:00"
  
  # Connection pooling
  db_cluster_parameter_group_name = aws_rds_cluster_parameter_group.human_pg.name
  
  # Enable Performance Insights
  enabled_cloudwatch_logs_exports = ["postgresql"]
}

# Secondary cluster (us-west-2) - read replica
resource "aws_rds_cluster" "secondary" {
  provider = aws.us_west_2
  
  cluster_identifier = "human-production-secondary"
  engine = "aurora-postgresql"
  engine_version = "15.3"
  
  # Replicate from primary
  replication_source_identifier = aws_rds_cluster.primary.arn
  
  # Can be promoted to primary on failover
  global_cluster_identifier = "human-production-global"
  
  # Multi-AZ
  availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"]
}

# Connection pooling with pgbouncer
resource "aws_ecs_service" "pgbouncer" {
  name = "pgbouncer"
  cluster = aws_ecs_cluster.human_production.id
  task_definition = aws_ecs_task_definition.pgbouncer.arn
  desired_count = 3
  
  load_balancer {
    target_group_arn = aws_lb_target_group.pgbouncer.arn
    container_name = "pgbouncer"
    container_port = 6432
  }
}

Replication Lag Monitoring:

# Prometheus alert for replication lag
- alert: RDSReplicationLagHigh
  expr: |
    aws_rds_replica_lag_seconds{cluster="human-production"} > 5
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "RDS replication lag >5 seconds"
    description: "Replication lag &#123;&#123; $value &#125;&#125;s may impact failover RTO"
    action: "Investigate replication performance"

Zero-Downtime Deployment with Terraform

Kubernetes Blue/Green via Terraform:

# Blue deployment (current production)
resource "kubernetes_deployment" "companion_api_blue" {
  metadata {
    name = "companion-api-blue"
    labels = {
      app = "companion-api"
      deployment = "blue"
    }
  }
  
  spec {
    replicas = 10
    
    selector {
      match_labels = {
        app = "companion-api"
        deployment = "blue"
      }
    }
    
    template {
      metadata {
        labels = {
          app = "companion-api"
          deployment = "blue"
          version = var.current_version
        }
      }
      
      spec {
        container {
          name = "companion-api"
          image = "human/companion-api:${var.current_version}"
          
          resources {
            requests {
              cpu = "500m"
              memory = "512Mi"
            }
            limits {
              cpu = "1000m"
              memory = "1Gi"
            }
          }
        }
      }
    }
  }
}

# Green deployment (new version, starts at 0 replicas)
resource "kubernetes_deployment" "companion_api_green" {
  metadata {
    name = "companion-api-green"
    labels = {
      app = "companion-api"
      deployment = "green"
    }
  }
  
  spec {
    replicas = var.deploy_active ? 10 : 0  # Controlled by deploy script
    
    selector {
      match_labels = {
        app = "companion-api"
        deployment = "green"
      }
    }
    
    template {
      metadata {
        labels = {
          app = "companion-api"
          deployment = "green"
          version = var.new_version
        }
      }
      
      spec {
        container {
          name = "companion-api"
          image = "human/companion-api:${var.new_version}"
          
          resources {
            requests {
              cpu = "500m"
              memory = "512Mi"
            }
            limits {
              cpu = "1000m"
              memory = "1Gi"
            }
          }
        }
      }
    }
  }
}

# Service points to blue or green
resource "kubernetes_service" "companion_api" {
  metadata {
    name = "companion-api"
  }
  
  spec {
    selector = {
      app = "companion-api"
      deployment = var.active_deployment  # "blue" or "green"
    }
    
    port {
      port = 80
      target_port = 3000
    }
    
    type = "ClusterIP"
  }
}

Deploy Script with Automated Rollback:

#!/bin/bash
# scripts/deploy-production.sh

set -e

NEW_VERSION=$1

# 1. Update green deployment to new version
terraform apply \
  -var="new_version=${NEW_VERSION}" \
  -var="deploy_active=true" \
  -target=kubernetes_deployment.companion_api_green

# 2. Wait for green pods ready
kubectl wait --for=condition=available \
  deployment/companion-api-green \
  --timeout=5m

# 3. Smoke tests
curl -f http://companion-api-green:3000/health || exit 1

# 4. Switch traffic (instant)
terraform apply \
  -var="active_deployment=green"

# 5. Monitor for 5 minutes
sleep 300

ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query" \
  --data-urlencode 'query=sum(rate(http_requests_total{version="'${NEW_VERSION}'",status=~"5.."}[5m])) / sum(rate(http_requests_total{version="'${NEW_VERSION}'"}[5m]))' \
  | jq -r '.data.result[0].value[1]')

if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
  echo "❌ Rollback: error rate ${ERROR_RATE} > 1%"
  
  # Instant rollback
  terraform apply -var="active_deployment=blue"
  
  exit 1
fi

# 6. Success - scale down blue
terraform apply -var="deploy_active=false" \
  -target=kubernetes_deployment.companion_api_blue

echo "βœ… Deploy complete"

Infrastructure State Management

Remote State Backend:

# terraform/backend.tf

terraform {
  backend "s3" {
    bucket = "human-terraform-state"
    key = "production/terraform.tfstate"
    region = "us-east-1"
    
    # State locking
    dynamodb_table = "terraform-state-lock"
    encrypt = true
    
    # Versioning enabled on S3 bucket
  }
}

# State locking table
resource "aws_dynamodb_table" "terraform_state_lock" {
  name = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key = "LockID"
  
  attribute {
    name = "LockID"
    type = "S"
  }
  
  tags = {
    Name = "Terraform State Lock"
    Environment = "production"
  }
}

Deployment Rollout Timeline

Month Milestone Status
Month 1 Deploy US-East + US-West simultaneously Pending approval (+$4k/month cost)
Month 1 Terraform IaC for all infrastructure Pending
Month 1 Blue/green deployment automation Pending
Month 2 Regional failover tested monthly Pending
Month 3 Chaos engineering: kill region monthly Pending
Month 6 Add EU-West (3-region active-active) Planning

See Also:

  • kb/102_performance_engineering_guide.md - 4-nines architecture overview
  • kb/103_monitoring_and_observability_setup.md - Multi-region observability
  • kb/129_ai_driven_operations_strategy.md - AI-driven deployment automation

CROSS-REFERENCES

  • See: 26_hybrid_stack_architecture.md - Conceptual architecture and design philosophy
  • See: 49_devops_and_infrastructure_model.md - Operational infrastructure and multi-cloud strategy
  • See: 11_engineering_blueprint.md - System layers and component architecture
  • See: 107_developer_adoption_playbook.md - How deployment flexibility supports developer GTM
  • See: 109_pricing_mechanics_and_billing.md - How deployment profiles affect pricing
  • See: 43_haio_developer_architecture.md - API architecture that works across all profiles

Metadata

Created: November 26, 2025
Version: 1.0
Strategic Purpose: Enable every customer segment with zero-regret hosting
Audience: Technical decision-makers, solutions architects, platform teams
Related Docs: 26, 49, 11, 107, 109, 43

Line Count: ~590 lines
Status: βœ… Complete - Deployment Models and Hosting Strategy