Skip to main content

Cloudflare's Toxic Combinations: A Practical Compound-Signal Checklist for Incident Prevention

· 7 min read
Victor Jimenez
Software Engineer & AI Agent Builder

Your deploy was fine. Your WAF rule update was also fine. Both hitting the same service within fifteen minutes at 2 a.m.? That is where the outage lives, and your single-metric dashboards will smile green the entire time. Cloudflare wrote an entire postmortem about this blind spot — stacked low-signal anomalies that every alert evaluates in isolation and nobody evaluates together — so I turned it into an enforceable playbook before the next on-call learns the lesson the hard way.

How Toxic Combinations Work

"Incidents often come from individually normal events that become dangerous only when correlated in a short time window."

— Cloudflare, The Curious Case of Toxic Combinations

Context

This is where single-metric alerting fails. Each signal below is individually normal and would not trigger an alert on its own. The danger is in the combination. The fix is a playbook that defines which low signals should be paired, correlation windows for each pair, and escalation thresholds tied to blast radius.

Why Per-Signal Alerting Misses These

  1. A change is valid in isolation.
  2. Another change is also valid in isolation.
  3. Existing controls evaluate each signal separately.
  4. No control evaluates the combination in real time.
  5. A low-probability overlap becomes a high-impact outage.

Alert-Correlation Playbook

Combo IDLow-signal ALow-signal BWindowEscalate WhenSeverity
TC-012x deploys to same service in 30 minp95 latency up 15% for 10 min30 minError budget burn >2%/hourSEV-3
TC-02WAF managed-rule update403 rate up 1.5x on authenticated paths15 min>=2 regions or >=5% signed-in trafficSEV-2
TC-03Feature flag enabled for >=10% trafficDB lock wait p95 >300ms for 5 min20 minCheckout/login in impact setSEV-2
TC-04Secrets rotation completedAuth token validation failures >0.7%20 minSustained 10 min after rotationSEV-2
TC-05Autoscaler event >=20%Upstream 5xx rises above 0.5%15 minQueue lag growth >25%SEV-2
TC-06Cache purge or key-schema changeOrigin egress up 40%20 minCDN hit ratio drops >=10 pointsSEV-3
TC-07Rate-limit policy changeSupport error reports >=5 in 15 min15 minSame route/tenant in both setsSEV-3
TC-08DNS/proxy config changeRegional timeout >1.2%30 minPayment/auth path impactedSEV-1

Correlation Rules to Implement First

Start with deterministic rules before ML anomaly scoring:

  1. Group by service + env + region + deploy_sha in rolling windows.
  2. Require at least one control-plane signal (deploy/config/policy) and one data-plane signal (latency/errors/timeouts).
  3. Suppress duplicate pages for 15 minutes after acknowledgment, but keep event count rising in timeline.
  4. Auto-attach runbook links by combo ID (TC-01...TC-08) in page payload.
  5. Auto-promote to next severity tier if condition persists for 2 windows.

Pre-Deploy Checklist for Agent Workflows

#CheckBlock If "No"
1Change coupling: did this touch auth, routing, flags, secrets, schema, or policy at the same time?Advisory
2Blast radius: if these fail together, is impact local, regional, or global?Advisory
3Concurrency: other in-flight deploys in same 30-60 min window?Advisory
4Control + data plane overlap: modified both control logic and request path?Block
5Rollback certainty: can we roll back every component independently in <5 min?Block
6Guardrail coverage: tests assert interaction path, not just component paths?Advisory
7Canary realism: canary traffic includes high-risk edge cases?Advisory
8Signal correlation alert: alerts fire when two low-severity signals co-occur?Block
9Kill-switch readiness: verified emergency flag to disable new interaction path?Block
10Ownership clarity: single incident commander for this combined risk surface?Advisory
Reality Check

If any answer is "no" for items 4, 5, 8, or 9, block autonomous merge/deploy and require human approval. Most agent-driven deployments break here because they evaluate each change in isolation and never consider compound risk. Two safe changes can still produce one unsafe deployment.

Integration-specific security checks
  • Verify every third-party integration has scoped tokens and per-environment credentials
  • Require explicit allowlists for outbound hosts in agent actions and CI runners
  • Deny silent fallback behavior when integration auth fails; fail fast and alert
  • Confirm audit logs link each automated action to actor, workflow run, and change set
  • Validate revocation path: rotating integration keys must complete without downtime

Agent + CI Implementation

StepAction
1Add toxic_combo_id evaluation in CI/CD metadata and runtime alert processor
2Compute compound_risk_score from combo count, critical-path weight, and persistence
3Fail closed when compound_risk_score >= 70 and rollback certainty is not verified
4Require two-key approval for any deploy touching control-plane + auth/routing paths
5Emit toxic_combination_candidate events and review weekly, including near misses

Why this matters for Drupal and WordPress

Drupal and WordPress sites on managed or platform hosting (Pantheon, Acquia, WP Engine, Cloudflare, etc.) often see "normal" changes in isolation: a deploy, a WAF or CDN config tweak, a cache purge, or a DB/plugin update. Toxic combinations happen when two or more of these land in a short window and no one correlates them. Platform and agency teams running CI for Drupal/WordPress should adopt compound-signal checks: define which low-signal pairs (e.g. deploy + latency spike, cache purge + origin load) matter for your stack, set correlation windows and escalation thresholds, and run them in CI or in your observability pipeline so the next incident is caught before users notice.

Takeaways

  • Cloudflare's "toxic combinations" pattern maps directly onto agent and CI workflows where multiple automated changes land in the same window without cross-checking each other.
  • Per-signal alerting will keep missing real incidents. Compound signal detection catches the overlaps that matter.
  • The pre-deploy checklist converts postmortem hindsight into gates that run before code ships.
  • Deterministic correlation rules first; ML anomaly scoring layered on top once you have labeled data from production near-misses.

References


Looking for an Architect who doesn't just write code, but builds the AI systems that multiply your team's output? View my enterprise CMS case studies at victorjimenezdev.github.io or connect with me on LinkedIn.