On Tuesday, many users may have seen the now-famous Cloudflare error 500 while surfing the web. Between 11:30 and 14:30 UTC, countless pages and services were unreachable. Among them were Ikea, PayPal, ChatGPT, X (formerly Twitter) and others. Notebookcheck was also affected.
Cloudflare offers various services for website operators
When talking about the biggest players on the internet, Amazon, Google, Microsoft, and Meta (Facebook) are usually the first to be mentioned. If something goes wrong with them, large parts of the internet stop functioning. Cloudflare, which mainly works to protect websites from attacks and speed them up, has been mostly overlooked. Many websites and services rely on Cloudflare's services to shorten loading times and protect their servers.
By caching data from websites and services and acting as a proxy, Cloudflare helps make the connection between clients and servers go more smoothly. Additionally, Cloudflare filters malicious requests and ensures that spikes in load are intercepted. It is perhaps most well-known for its protection against DDoS attacks. For website operators, optimizing loading times by caching pages across various servers worldwide is often the most important aspect. Numerous websites depend on Cloudflare's services to offload their own servers and simultaneously decrease visitor latency.
On November 18th, a widespread outage occurred at Cloudflare
On Tuesday, a severe error struck Cloudflare's network, causing websites and services from its customers to become inaccessible. In a blog post, Matthew Prince, CEO of Cloudflare, detailed the events leading up to the largest outage in Cloudflare's network since 2019.
Around 11:30 UTC, an extremely high number of error 5xx codes began flooding Cloudflare due to a configuration mistake. However, the number of errors fluctuated significantly until 13:00 UTC, initially leading Cloudflare to believe it was facing an external attack. This assumption was further supported by the fact that the status page from Cloudflare itself became inaccessible at this time. After some time, the error rate returned to expected low levels within their network. Earlier discussions in internal chats speculated about a botnet being responsible for the outage.
The actual problem originated within Cloudflare's network. A change in permissions of a database system led to various errors. This was already implemented around 11:05 UTC. As a result, the size of a feature file from the bot management system was artificially inflated and nearly doubled its original size. However, Cloudflare programs have a fixed size set for this file, which is also reserved in memory. The oversized files overfilled the reserved memory, resulting in a system crash. Since the feature file was updated every five minutes and not all clusters of Cloudflare were running on the new configuration, it became possible that either a fully operational or a non-functional file could be distributed throughout the network at any given time. This explains the fluctuations in error frequency. Around 13:37, Cloudflare's incident response team realized that adjustments to the bot management system were causing the outage. An hour later, they finally managed to resolve the issue.
Die Auswirkungen des Cloudflare-Ausfalls zeigen deutlich die fragwürdige Abhängigkeit des Internets von wenigen Akteuren. Ein einziger Konfigurationsfehler an einem zentralen Schlüsselpunkt hat hier ausgereicht, dass unzählige Webseiten und Services nicht mehr erreichbar waren. Damit stellt sich die Frage, wie anfällig das Internet, so wie wir es kennen, wirklich ist.








