7 Key Resiliency Upgrades: How Cloudflare's 'Fail Small' Initiative Makes the Network Stronger

<p>After two intensive quarters of engineering work, Cloudflare has completed its internal 'Code Orange: Fail Small' project. This initiative was born from lessons learned during the global outages on November 18 and December 5, 2025. The goal: to build a network that absorbs shocks rather than amplifies them. While perfect uptime is a journey, the changes now in place would have prevented those specific incidents. From safer configuration rollouts to smarter incident response, here are the <strong>seven critical upgrades</strong> that now protect every Cloudflare customer.</p> <h2 id="item1">1. The 'Fail Small' Mindset</h2> <p>The core philosophy behind Code Orange is <strong>failing small</strong>. Instead of a single error cascading across the entire network, new safeguards contain problems at the earliest stage. This means that when a configuration change or software update goes wrong, only a tiny fraction of traffic — or no traffic at all — is affected. The system automatically detects anomalies and isolates the failure before it can grow. This mindset is now embedded in every team’s development lifecycle, ensuring that resilience is not an afterthought but a default property.</p><figure style="margin:20px 0"><img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4jmuJjPlgOaXJe4N4pFII6/e727b99cf584fbf177d86e5785078957/Copy_of_OG_Share_2024-2025-2026.png" alt="7 Key Resiliency Upgrades: How Cloudflare&#039;s &#039;Fail Small&#039; Initiative Makes the Network Stronger" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: blog.cloudflare.com</figcaption></figure> <h2 id="item2">2. Health-Mediated Configuration Deployment</h2> <p>Historically, configuration changes could reach the entire network instantaneously. That is no longer the case. Cloudflare now uses <strong>health-mediated deployment</strong> for all high-risk configuration pipelines. Changes are rolled out progressively, with real-time monitoring comparing key metrics (latency, error rates, traffic drops) against baselines. If the system detects a negative trend, it automatically reverts the change before it impacts customer traffic. This approach — long used for software releases — is now applied to every configuration data file, control flag, and policy update. The result: fewer surprises and safer updates.</p> <h2 id="item3">3. Snapstone: The New Deployment Backbone</h2> <p>A key enabler of health-mediated config changes is a new internal system called <strong>Snapstone</strong>. Snapstone bundles configuration changes into deployable packages and orchestrates their gradual release with built-in health checks. Before Snapstone, each team had to build its own rollout mechanism, leading to inconsistency. Snapstone provides a unified, reusable platform that makes progressive deployment the default. It can handle any unit of configuration — from the data file that triggered the November outage to the control flag involved in December — and automatically rolls back if health thresholds are breached. This flexibility ensures that lessons from past failures protect against future ones.</p> <h2 id="item4">4. Reduced Blast Radius: Containing Failures</h2> <p>Even with safer changes, failures can still occur. The next layer of defense is <strong>reducing the blast radius</strong>. Cloudflare has re-architected internal systems so that a single misconfiguration in one region or service cannot cascade globally. By segmenting the network into smaller, isolated units and limiting the propagation of errors, the impact of any incident stays contained. For example, a bad packet filter update in <em>us-east</em> will no longer affect traffic in <em>eu-west</em>. This compartmentalization gives teams more time to respond and minimizes customer disruption.</p> <h2 id="item5">5. Revamped 'Break Glass' Procedures</h2> <p>When emergencies demand immediate access to critical systems, <strong>break glass</strong> procedures must be fast yet secure. Cloudflare has overhauled its emergency access protocols to ensure that authorized engineers can quickly bypass normal safeguards during an incident without introducing new risks. The new procedures include time-limited credentials, strict logging, and automatic reviews after the event. This balance ensures that urgent actions — like overriding a faulty configuration — can be taken in seconds, but always leave an audit trail for post-incident analysis.</p><figure style="margin:20px 0"><img src="https://blog.cloudflare.com/cdn-cgi/image/format=auto,dpr=3,width=64,height=64,gravity=face,fit=crop,zoom=0.5/https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1yTvNpd60qmjgY8fbItcDp/f964f6cd281c1693cee7b4a43a6e3845/jeremy-hartman.jpeg" alt="7 Key Resiliency Upgrades: How Cloudflare&#039;s &#039;Fail Small&#039; Initiative Makes the Network Stronger" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: blog.cloudflare.com</figcaption></figure> <h2 id="item6">6. Smarter Incident Management &amp; Customer Communication</h2> <p>Outages are not just technical problems; they are communication challenges. Cloudflare has invested heavily in <strong>incident management</strong> tooling and training. New dashboards provide real-time status updates internally, while automated systems now detect when customers might be affected. Communication to customers during incidents has been strengthened: status pages update faster, messages are clearer about the root cause and expected resolution time, and follow-up postmortems include detailed technical explanations. Transparency and speed are now the standard.</p> <h2 id="item7">7. Preventing Drift: Long-Term Resilience</h2> <p>Resilience can erode over time as teams ship new features and make ad-hoc changes. To prevent <strong>drift</strong>, Cloudflare has implemented regular regression testing and automated compliance checks. High-risk configuration patterns are monitored continuously, and any deviation from approved safe states triggers alerts. Additionally, the engineering culture now emphasizes <strong>resilience by design</strong>: every new project includes a review of how it could fail small. These measures ensure that the improvements from Code Orange remain effective as the network evolves.</p> <p><strong>Conclusion:</strong> Cloudflare’s Code Orange: Fail Small project is a comprehensive upgrade to how the network handles change and failure. By embedding health-mediated deployment, reducing blast radius, improving incident response, and preventing long-term drift, the company has created a more robust foundation for its 38 million+ Internet properties. While no network is immune to all problems, the lessons from the 2025 outages have been translated into concrete engineering safeguards. The result is a stronger, more resilient Cloudflare — one that truly <em>fails small</em> so that customers can keep growing big.</p>