Alert in, remediation out. With dedup, approval gates, and full audit.
CyGuru is the alert-driven self-heal pipeline. Two detection inputs (Alertmanager webhook + 60s SelfMonitorLoop probes), one dedup layer (60-minute window), one approval gate (default OFF for destructive), one executor. Handlers ship for pod crashloops, MongoDB rolling restarts, node memory pressure cordoning, and stale OVN chassis recovery. Every action audited; every action reversible.
Crashloop >3 restarts in 5min → CyGuru proposes pod-restart with audit context. Admin approves; executor runs; success metric exported to Prometheus.
Mongo replica becomes unreachable → CyGuru proposes safe rolling restart. Avoids 3am pages for the SRE on-call.
Node >90% memory → cordon + drain non-critical pods. New schedules go elsewhere. Operator notified, not paged.
OVN chassis registered by UUID instead of hostname → known production bug → CyGuru auto-fixes via node-agent restart.
Implement RemediationHandler interface. CyGuru handles dedup, approval, audit, metric emission. You write the fix logic.
Different remediations route to different approvers. PHI-touching = compliance officer. Network = security. Etc.
Prevents alert storms from generating remediation storms.
Default policy: human approves before execution. Trust earned over time, narrow LOW_RISK allowlist.
proposed / deduped / approved / executed counters; duration histograms; auto-dashboards in Grafana.
Alertmanager webhook (push) + 60s SelfMonitorLoop (pull). Different remediation triggers, same pipeline.
Every executor action has a rollback plan or is non-destructive. Audit log captures both.
Implement Java RemediationHandler interface; CyGuru loads it. No fork required.
30-day proof of concept on two idle servers. We bring the SE. You bring the use case.