Home
Blog
Google’s cloud outage: Move slowly and break things anyway

Google’s cloud outage: Move slowly and break things anyway

Paul Ducklin
06/17/2025
Share this article:

Friday the Thirteenth

The good news is that Google’s big-news global cloud outage of 2025-06-12 and 2025-06-13 (yes, that was Friday the Thirteenth) was sorted out within a few hours, a creditable response given the scale of the bug.

The bad news, of course, is the point we just mentioned: the sheer scale of the bug.

Intriguingly, the word sheer has two contrasting meanings, though both of them are metaphorically relevant in this case.

Sheer stockings, for example, are so thin and finely woven that they are easily torn and ruined without hope of repair; sheer cliffs, on the other hand, are so solid and vertically intimidating that they are difficult or even impossible to climb.

To give you an idea of just how sheer the bug was, consider that Google has stated that these Google products alone were broken in the outage, meaning that every business, site or service that depends on one of these may have been temporarily affected:

API Gateway, Agent Assist, AlloyDB for PostgreSQL, Apigee, AppSheet, Artifact Registry, AutoML Translation, BigQuery Data Transfer Service, Cloud Asset Inventory, Cloud Build, Cloud Data Fusion, Cloud Firestore, Cloud Healthcare, Cloud Key Management Service, Cloud Load Balancing, Cloud Logging, Cloud Memorystore, Cloud Monitoring, Cloud NAT, Cloud Run, Cloud Security Command Center, Cloud Shell, Cloud Spanner, Cloud Vision, Cloud Workflows, Cloud Workstations, Colab Enterprise, Contact Center AI Platform, Contact Center Insights, Database Migration Service, Dataplex, Dataproc Metastore, Datastream, Dialogflow CX, Dialogflow ES, Document AI, Gmail, Google App Engine, Google BigQuery, Google Calendar, Google Chat, Google Cloud Bigtable, Google Cloud Composer, Google Cloud Console, Google Cloud DNS, Google Cloud Dataproc, Google Cloud Deploy, Google Cloud Functions, Google Cloud Marketplace, Google Cloud NetApp Volumes, Google Cloud Pub/Sub, Google Cloud SQL, Google Cloud Search, Google Cloud Storage, Google Compute Engine, Google Docs, Google Drive, Google Meet, Google Security Operations, Google Tasks., Google Voice, Hybrid Connectivity, Identity Platform, Identity and Access Management, Integration Connectors, Looker (Google Cloud core), Looker Studio, Managed Service for Apache Kafka, Media CDN, Memorystore for Memcached, Memorystore for Redis, Memorystore for Redis Cluster, Migrate to Virtual Machines, Network Connectivity Center, Persistent Disk, Personalized Service Health, Pub/Sub Lite, Resource Manager API, Retail API, Speech-to-Text, Text-to-Speech, Traffic Director, VMWare engine, Vertex AI Feature Store, Vertex AI Search, Vertex Gemini API.

Servers in these locations were affected:

Belgium, Berlin, Columbus, Dallas, Dammam, Delhi, Doha, Finland, Frankfurt, Hong Kong, Iowa, Jakarta, Johannesburg, Las Vegas, London, Los Angeles, Madrid, Melbourne, Mexico, Milan, Montréal, Mumbai, Netherlands, Northern Virginia, Oregon, Osaka, Paris, Salt Lake City, Santiago, Seoul, Singapore, South Carolina, Stockholm, Sydney, São Paulo, Taiwan, Tel Aviv, Tokyo, Toronto, Turin, Warsaw, Zurich.

Popular large-scale services that reportedly suffered as a side-effect included:

Cloudflare, Discord, Firebase Studio, NPM, Snapchat, Spotify.

What went wrong?

To be fair to Google, the company has ‘fessed up to what went wrong, albeit wrapped in techno-babble rather than explained in plain English.

As an example, the explanation starts with the following paragraph:

On May 29, 2025, a new feature was added to Service Control for additional quota policy checks. This code change and binary release went through our region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code. As a safety precaution, this code change came with a red-button to turn off that particular policy serving path. The issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.

In blunter form, here’s Google seems to be saying:

  • We added some code to all our servers to check more carefully that our customers weren’t getting more data storage than they had paid for. But we never tested that this code worked correctly in real life. This new quota-checking feature just sat there ready for us to turn it on in the future.
  • We didn’t add any safety controls that would allow us to enable this feature selectively, server by server, region by region, when we did get round to trying it out.
  • A couple of weeks later, we accidentally activated the new feature worlwide by pushing out a malformed policy update with missing data fields, thereby inadvertently starting up the new quota-checking code with no configuration data.
  • The untested code turned out to have an old-school bug, namely that it assumed it would always have configuration data to work with, tried to read from a non-existent region of memory, and crashed. This denied access to everyone. (To be clear: this was better than accidentally letting everyone in.)

Apparently, this series of “move slowly but break things anyway” errors was made slightly worse by one further problem, namely that the system hadn’t been designed to deal with what electricity grids would refer to as a dead start, where the system has to be brought up from complete failure.

Dead starts typically need to be handled in a carefully controlled way, rather than all-at-once, to avoid crashing all over again.

If you’ve ever had your domestic electricity trip during winter-time, you’ll know that simply turning the main breaker back on probably won’t work. The combined startup current required to restart all your high-power devices at the same time (notably devices such as heat pumps that rely on electric motors) just causes the breaker to trip again. You need to spread the load by turning on individual circuits one-by-one and waiting for each device to settle into regular operation before trying the next one.

What to do?

  • Never roll out untested code into production on the assumption that it won’t be activated yet. Always assume that live code could be turned on by someone at some point.
  • Avoid turning on new code everywhere at once, even if you have tested it extensively, unless there is a clear and present danger that outweighs the risk of finding that it doesn’t work as well as you thought. (Additional customer quota checks don’t qualify as a “cybersecurity emergency.”)
  • Plan for dark start emergencies where broken code generates yet more load by trying to “fix” itself too aggressively.
  • Add some sort of randomness into any startup process, along with what’s known as backoff, where a process that has already failed N times waits longer before its (N+1)th attempt, thus cutting some slack for other processes that would start up fine if only they could get a place in the queue.

Google’s incident report explicitly mentions exponential backoff, which loosely means doubling the time a failed process waits after each successive failure (e.g. wait 1 second, then 2 seconds, then 4, 8, 16 and so on), but anything that avoids lots of processes or servers inadvertently activating at the same time can help.

Remember: if even the behemoth that is Google can trip itself and its customers up with a null pointer error, then anyone can.

Null pointers are almost always encoded as “a memory address of zero”, denoting either that necessary memory space has not yet been allocated so the program cannot not proceed, or that an error has occurred and the program should detect and respond accordingly rather than ploughing on regardless.

Null pointer crashes therefore often happen because an error was ignored, thereby provoking a yet more serious error for which the only safe response from the operating system is to kill off the program abruptly to prevent it doing yet more harm.

Error checking in your program code is a bit like making backups: the only time you’ll ever regret it is if you forget to do it!


Learn more about our mobile security solution that goes beyond traditional MDM (mobile device management) software, and offers active on-device protection that’s more like the EDR (endpoint detection and response) tools you are used to on laptops, desktops and servers:

Google's cloud outage: Move slowly and break things anyway - SolCyber


More About Duck

Paul Ducklin is a respected expert with more than 30 years of experience as a programmer, reverser, researcher and educator in the cybersecurity industry. Duck, as he is known, is also a globally respected writer, presenter and podcaster with an unmatched knack for explaining even the most complex technical issues in plain English. Read, learn, enjoy!

Paul Ducklin
Paul Ducklin
06/17/2025
Share this article:

Table of contents:

The world doesn’t need another traditional MSSP 
or MDR or XDR.

What it requires is practicality and reason.

Related articles

Choose identity-first managed security.

We start with identity and end with transparency — protecting where attacks begin and keeping you informed, with as much visibility as you want. No black boxes, just clear, expert-driven security.
No more paying for useless bells and whistles.
No more time wasted on endless security alerts.
No more juggling multiple technologies and contracts.

Follow us!

Subscribe

Join our newsletter to stay up to date on features and releases.

By subscribing you agree to our Privacy Policy and provide consent to receive updates from our company.

©
2025
SolCyber. All rights reserved
|
Made with
by
Jason Pittock

I am interested in
SolCyber XDR++™

I am interested in
SolCyber MDR++™

I am interested in
SolCyber Extended Coverage™

I am interested in
SolCyber Foundational Coverage™

I am interested in a
Free Demo

12088