CyGuru

CyGuru is the alert-driven self-heal pipeline. Two detection inputs (Alertmanager webhook + 60s SelfMonitorLoop probes), one dedup layer (60-minute window), one approval gate (default OFF for destructive), one executor. Handlers ship for pod crashloops, MongoDB rolling restarts, node memory pressure cordoning, and stale OVN chassis recovery. Every action audited; every action reversible.

Gallery

See Cyguru in action.

Alert in, remediation out

Alertmanager webhook plus 60-second SelfMonitorLoop. Two detection inputs, one dedup layer, one approval gate, one executor.

60-minute dedup window

Alert storms cannot generate remediation storms.

Full Prometheus metrics

Proposed / deduped / approved / executed counters. Duration histograms. Auto-dashboards in Grafana.

Pluggable handlers

Implement RemediationHandler interface. CyGuru loads dynamically. No fork required.

Use Cases

Where CyGuru wins.

Pod crashloop auto-recovery

Crashloop >3 restarts in 5min → CyGuru proposes pod-restart with audit context. Admin approves; executor runs; success metric exported to Prometheus.

MongoDB rolling restart on outage

Mongo replica becomes unreachable → CyGuru proposes safe rolling restart. Avoids 3am pages for the SRE on-call.

Node memory pressure auto-cordon

Node >90% memory → cordon + drain non-critical pods. New schedules go elsewhere. Operator notified, not paged.

OVN stale chassis cleanup

OVN chassis registered by UUID instead of hostname → known production bug → CyGuru auto-fixes via node-agent restart.

Custom handlers for your stack

Implement RemediationHandler interface. CyGuru handles dedup, approval, audit, metric emission. You write the fix logic.

Compliance-driven approval workflows

Different remediations route to different approvers. PHI-touching = compliance officer. Network = security. Etc.

Key Capabilities

What's inside.

🔁

60-min dedup window

Prevents alert storms from generating remediation storms.

✋

Approval-first default

Default policy: human approves before execution. Trust earned over time, narrow LOW_RISK allowlist.

📈

Full Prometheus metrics

proposed / deduped / approved / executed counters; duration histograms; auto-dashboards in Grafana.

🤖

Two detection inputs

Alertmanager webhook (push) + 60s SelfMonitorLoop (pull). Different remediation triggers, same pipeline.

📜

Reversible by design

Every executor action has a rollback plan or is non-destructive. Audit log captures both.

🔧

Pluggable handlers

Implement Java RemediationHandler interface; CyGuru loads it. No fork required.

vs competition: Replaces PagerDuty toil + custom remediation scripts. Not a replacement for capacity planning or root-cause analysis.

Full battle card →

See Cyguru in action.

Alert in, remediation out

60-minute dedup window

Full Prometheus metrics

Pluggable handlers

Where CyGuru wins.

Pod crashloop auto-recovery

MongoDB rolling restart on outage

Node memory pressure auto-cordon

OVN stale chassis cleanup

Custom handlers for your stack

Compliance-driven approval workflows

What's inside.

60-min dedup window

Approval-first default

Full Prometheus metrics

Two detection inputs

Reversible by design

Pluggable handlers

Ready to see CyGuru in action?