Skip to main content

Cloudflare's Toxic Combinations: A Practical Compound-Signal Checklist for Incident Prevention

· 6 min read
Victor Jimenez
Software Engineer & AI Agent Builder

Cloudflare's "toxic combinations" lesson is simple: incidents often come from individually normal events that become dangerous only when correlated in a short time window. The useful operational takeaway is not just "be careful with change." It is to encode correlation logic that promotes stacked low-signal anomalies before they become user-visible incidents.

I turned their postmortem insight into an enforceable playbook.

The Pattern

"Incidents often come from individually normal events that become dangerous only when correlated in a short time window."

— Cloudflare, The Curious Case of Toxic Combinations

Context

This is where single-metric alerting fails. Each signal below is individually normal and would not trigger an alert on its own. The danger is in the combination. The fix is a playbook that defines which low signals should be paired, correlation windows for each pair, and escalation thresholds tied to blast radius.

The Anti-Pattern

  1. A change is valid in isolation.
  2. Another change is also valid in isolation.
  3. Existing controls evaluate each signal separately.
  4. No control evaluates the combination in real time.
  5. A low-probability overlap becomes a high-impact outage.

Alert-Correlation Playbook

Combo IDLow-signal ALow-signal BWindowEscalate WhenSeverity
TC-012x deploys to same service in 30 minp95 latency up 15% for 10 min30 minError budget burn >2%/hourSEV-3
TC-02WAF managed-rule update403 rate up 1.5x on authenticated paths15 min>=2 regions or >=5% signed-in trafficSEV-2
TC-03Feature flag enabled for >=10% trafficDB lock wait p95 >300ms for 5 min20 minCheckout/login in impact setSEV-2
TC-04Secrets rotation completedAuth token validation failures >0.7%20 minSustained 10 min after rotationSEV-2
TC-05Autoscaler event >=20%Upstream 5xx rises above 0.5%15 minQueue lag growth >25%SEV-2
TC-06Cache purge or key-schema changeOrigin egress up 40%20 minCDN hit ratio drops >=10 pointsSEV-3
TC-07Rate-limit policy changeSupport error reports >=5 in 15 min15 minSame route/tenant in both setsSEV-3
TC-08DNS/proxy config changeRegional timeout >1.2%30 minPayment/auth path impactedSEV-1

Correlation Rules to Implement First

Start with deterministic rules before ML anomaly scoring:

  1. Group by service + env + region + deploy_sha in rolling windows.
  2. Require at least one control-plane signal (deploy/config/policy) and one data-plane signal (latency/errors/timeouts).
  3. Suppress duplicate pages for 15 minutes after acknowledgment, but keep event count rising in timeline.
  4. Auto-attach runbook links by combo ID (TC-01...TC-08) in page payload.
  5. Auto-promote to next severity tier if condition persists for 2 windows.

Pre-Deploy Checklist for Agent Workflows

#CheckBlock If "No"
1Change coupling: did this touch auth, routing, flags, secrets, schema, or policy at the same time?Advisory
2Blast radius: if these fail together, is impact local, regional, or global?Advisory
3Concurrency: other in-flight deploys in same 30-60 min window?Advisory
4Control + data plane overlap: modified both control logic and request path?Block
5Rollback certainty: can we roll back every component independently in <5 min?Block
6Guardrail coverage: tests assert interaction path, not just component paths?Advisory
7Canary realism: canary traffic includes high-risk edge cases?Advisory
8Signal correlation alert: alerts fire when two low-severity signals co-occur?Block
9Kill-switch readiness: verified emergency flag to disable new interaction path?Block
10Ownership clarity: single incident commander for this combined risk surface?Advisory
Reality Check

If any answer is "no" for items 4, 5, 8, or 9, block autonomous merge/deploy and require human approval. This is where most agent-driven deployments fail — they evaluate each change in isolation without considering the compound risk surface.

Integration-specific security checks
  • Verify every third-party integration has scoped tokens and per-environment credentials
  • Require explicit allowlists for outbound hosts in agent actions and CI runners
  • Deny silent fallback behavior when integration auth fails; fail fast and alert
  • Confirm audit logs link each automated action to actor, workflow run, and change set
  • Validate revocation path: rotating integration keys must complete without downtime

Agent + CI Implementation

StepAction
1Add toxic_combo_id evaluation in CI/CD metadata and runtime alert processor
2Compute compound_risk_score from combo count, critical-path weight, and persistence
3Fail closed when compound_risk_score >= 70 and rollback certainty is not verified
4Require two-key approval for any deploy touching control-plane + auth/routing paths
5Emit toxic_combination_candidate events and review weekly, including near misses

What I Learned

  • Cloudflare's "toxic combinations" is the most useful incident pattern I have seen for agent and CI workflows.
  • Single-signal alerting misses real incidents. Compound signal detection is the fix.
  • The pre-deploy checklist turns postmortem insight into enforceable automation.
  • Start with deterministic correlation rules. ML anomaly scoring can come later.

References