tptacek 15 hours ago

I'm a tedious broken record about this (among many other things) but if you haven't read this Richard Cook piece, I strongly recommend you stop reading this postmortem and go read Cook's piece first. It won't take you long. It's the single best piece of writing about this topic I have ever read and I think the piece of technical writing that has done the most to change my thinking:

https://how.complexsystems.fail/

You can literally check off the things from Cook's piece that apply directly here. Also: when I wrote this comment, most of the thread was about root-causing the DNS thing that happened, which I don't think is the big story behind this outage. (Cook rejects the whole idea of a "root cause", and I'm pretty sure he's dead on right about why.)

  • cb321 12 hours ago

    That minimalist post mortem for the public is of what sounds like a Rube Goldberg machine and the reality is probably even more hairy. I completely agree that if one wants to understand "root causes", it's more important to understand why such machines are built/trusted/evolved in the first place.

    That piece by Cook is ok, but largely just a list of assertions (true or not, most do feel intuitive, though). I suppose one should delve into all those references at the end for details? Anyway, this is an ancient topic, and I doubt we have all the answers on those root whys. The MIT course on systems, 6.033, used to assign reading a paper raised on HN only a few times in its history: https://news.ycombinator.com/item?id=10082625 and https://news.ycombinator.com/item?id=16392223 It's from 1962, over 60 years ago, but that is also probably more illuminating/thought provoking than the post mortem. Personally, I suspect it's probably an instance of a https://en.wikipedia.org/wiki/Wicked_problem , but only past a certain scale.

    • tptacek 11 hours ago

      I have a housing activism meetup I have to get to, but real quick let me just say that these kinds of problems are not an abstraction to me in my day job, that I read this piece before I worked where I do and it bounced off me, but then I read it last year and was like "are you me but just smarter?", like my pupils probably dilated theatrically when I read it like I was a character in Requiem for a Dream, and I think most of the points he's making are much subtler and deeper than they seem at a casual read.

      You might have to bring personal trauma to this piece to get the full effect.

      • cb321 11 hours ago

        Oh, it's fine. At your leisure. I didn't mean to go against the assertions themselves, but more just kind of speak to their "unargued" quality and often sketchy presentation. Even that Simon piece has a lot of this in there, where it's sort of "by defenition of 'complexity'/by unelaborated observation".

        In engineered systems, there is just a disconnect between on our own/small scale KISS and what happens in large organizations, and then what happens over time. This is the real root cause/why, but I'm not sure it's fixable. Maybe partly addressable, tho'.

        One thing that might give you a moment of worry is both in that Simon and far, far more broadly all over academia both long before and ever since, biological systems like our bodies are an archetypal example of "complex". Besides medical failures, life mostly has this one main trick -- make many copies and if they don't all fail before they, too, can copy then a stable-ish pattern emerges.

        Stable populations + "litter size/replication factor" largely imply average failure rates. For most species it is horrific. On the David Attenborough specials they'll play the sad music and tell you X% of these offspring never make it to mating age. The alternative is not the https://en.wikipedia.org/wiki/Gray_goo apocalypse, but the "whatever-that-species-is-biopocalypse". Sorry - it's late and my joke circuits are maybe fritzing. So, both big 'L' and little 'l' life, too, "is on the edge", just structurally.

        https://en.wikipedia.org/wiki/Self-organized_criticality (with sand piles and whatnot) used to be a kind of statistical physics hope for a theory of everything of these kinds of phenomena, but it just doesn't get deployed. Things will seem "shallowly critical" but not so upon deeper inspection. So, maybe it's not not a useful enough approximation.

        Anyway, good luck with your housing meetup!

  • nickelpro 5 hours ago

    To quote Grandpa Simpson, "Everything everyone just said is either obvious or wrong".

    Pointing out that "complex systems" have "layers of defense" is neither insightful nor useful, it's obvious. Saying that any and all failures in a given complex system lack a root cause is wrong.

    Cook uses a lot of words to say not much at all. There's no concrete advise to be taken from How Complex Systems Fail, nothing to change. There's no casualty procedure or post-mortem investigation which would change a single letter of a single word in response to it. It's hot air.

    • baq 4 hours ago

      There’s a difference between ‘grown organically’ and ‘designed to operate in this way’, though. Experienced folks will design system components with conscious awareness of how operations actually look like from the start. Juniors won’t and will be bolting on quasi solutions as their systems fall over time and time again. Cook’s generalization is actually wildly applicable, but it takes work to map it to specific situations.

  • markus_zhang 12 hours ago

    As a contractor who is on an oncall schedule. I have never worked in a company that treats oncall as a very serious business. I only worked in 2 companies that need oncall so I’m biased. On paper, they both say it is serious and all SLA stuffs were setup, but in reality there is not enough support.

    The problem is, oncall is a full-time business. It takes full attention of the oncall engineer, whether there is an issue or not. Both companies simply treat oncall as a by-product. We just had to do it so let’s stuff it into the sprint. The first company was slightly more serious as we were asked to put up a 2-3 point oncall task in JIRA. The second one doesn’t even do this.

    Neither company really encourages engineers to read through complex code written by others, even if we do oncall for those products. Again, the first company did better, and we were supposed to create a channel and pull people in, so it’s OKish to not know anything about the code. The second company simply leaves oncall to do whatever they can. Neither company allocates enough time for engineers to read the source code thoroughly. And neither has good documentation for oncall.

    I don’t know the culture of AWS. I’d very much want to work in an oncall environment that is serious and encourages learning.

    • dekhn 10 hours ago

      When I was an SRE at Google our oncall was extremely serious (if the service went down, Google was unable to show ads, record ad impressions, or do any billing for ads). It was done on a rotation, lasted 1 week (IIRC it was 9AM-9PM, we had another time zone for the alternate 12 hours). The on-call was empowered to do pretty much anything required to keep the service up and running, including cancelling scheduled downtimes, pausing deployment updates, stop abusive jobs, stop abusive developers, and invoke an SVP if there was a fight with another important group).

      We sent a test page periodically to make sure the pager actually beeped. We got paid extra for being in the rotation. The leadership knew this was a critical step. Unfortunately, much of our tooling was terrible, which would cause false pages, or failed critical operations, all too frequently.

      I later worked on SWE teams that didn't take dev oncall very seriously. At my current job, we have an oncall, but it's best effort business hours only.

      • citizenpaul 6 hours ago

        >empowered to do pretty much anything required to keep the service up and running,

        Is that really uncommon? I've been on call for many companies and many types of institutions and never been told once I couldn't do something to bring a system up that I can recall at least. Its kinda the job?

        On call seriousness should be directly proportional to pay. Google pays. If smallcorp want to pay me COL I'll be looking at that 2AM ticket at 9AM when I get to work.

      • lanyard-textile 7 hours ago

        Handling my first non-prod alert bug as the oncall at Google was pretty eye opening :)

        It was a good lesson in what a manicured lower environment can do for you.

      • markus_zhang 10 hours ago

        That’s pretty good. Our oncall is actually 24-hour for one week. On paper it looks very serious but even the best of us don’t really know everything so issues tend to lag to the morning. Neither do we get any compensation for it. Someone got a bad night and still need to logon next day. There is an informal understanding to relax a bit if the night is too bad, though.

        • dmoy 8 hours ago

          I did 24hr-for-a-week oncall for 10+ years, do not recommend.

          12-12 rotation in SRE is a lot more reasonable for humans

          • sandeepkd 5 hours ago

            Unfortunately 24hr-for-a-week seems to be default everywhere nowdays, its just not practical for serious type businesses. It just an indicator of how important is the UPTIME for a company.

          • markus_zhang 7 hours ago

            I agree. It sucks. And our schedule is actually 2 weeks in every five. One is secondary and the other is primary.

    • malfist 11 hours ago

      Amazon generally treats on call as a full time job. Generally engineers who are on call are expected to only be on call. No feature work.

      • tidbits 10 hours ago

        It's very team/org dependent and I would say that's generally not the case. In 6 years I have only had 1 team out of 3 where that was true. The other two teams I was expected to juggle feature work with oncall work. Same for most teams I interacted with.

  • yabones 13 hours ago

    Another great lens to see this is "Normal Accidents" theory, where the argument is made that the most dangerous systems are ones where components are very tightly coupled, interactions are complex and uncontrollable, and consequences of failure are serious.

    https://en.wikipedia.org/wiki/Normal_Accidents

  • ponco an hour ago

    Respectfully, I don't think that piece adds anything of material substance. It's a list of hollow platitudes (vapid writing listing inactionable truisms).

  • ramraj07 8 hours ago

    As I was reading through that list, I kept feeling, "why do I feel this is not universally true?"

    Then I realized: the internet; the power-grid (at least in most developed countries); there are things that don't actually fail catastrophically, even though they are extremely complex, and not always built by efficient organizations. Whats the retort to this argument?

    • singron 7 hours ago

      They do fail catastrophically. E.g. https://en.wikipedia.org/wiki/Northeast_blackout_of_2003

      I think you could argue AWS is more complex than the electrical grid, but even if it's not, the grid has had several decades to iron out kinks and AWS hasn't. AWS also adds a ton of completely new services each year in addition to adding more capacity. E.g. I bet these DNS Enactors have become more numerous and their plans became much larger than when they were first developed, which has greatly increased the odds of experiencing this issue.

    • figassis 2 hours ago

      The grid fails catastrophically. It happened this year in Portugal, spain and nearby countries? Still, think of the grid as more like DNS. It is immense, but the concept is simple and well understood. You can quickly identify where the fault is (even if not the actual root cause), and can also quickly address it (even if bringing it back up in sync takes time and is not trivial). Current cloud infra is different in that each implementation is unique, services are unique, knowledge is not universal. There are no books about AWS's infra fundamentals or how to manage AWS's cloud.

    • jb1991 7 hours ago

      The power grid is a huge risk in several major western nations.

  • ericyd 11 hours ago

    I'll admit i didn't read all of either document, but I'm not convinced of the argument that one cannot attribute a failure to a root cause simply because the system is complex and required multiple points of failure to fail catastrophically.

    One could make a similar argument in sports that no one person ever scores a point because they are only put into scoring position by a complex series of actions which preceded the actual point. I think that's technically true but practically useless. It's good to have a wide perspective of an issue but I see nothing wrong with identifying the crux of a failure like this one.

    • Yokolos 8 hours ago

      The best example for this is aviation. Insanely complex from the machines to the processes to the situations to the people, all interconnected and constantly interacting. But we still do "root cause" analyses and based on those findings try to improve every point in the system that failed or contributed to the failure, because that's how we get a safer aviation industry. It's definitely worked.

    • wbl 11 hours ago

      Its extremely useful in sports. We evaluate batters on OPS vs RBI, and no one ever evaluated them on runs they happened to score. We talk all the time about a QB and his linemen working together and the receivers. If all we talked about was the immediate cause we'd miss all that.

  • nonfamous 9 hours ago

    Great link, thanks for sharing. This point below stood out to me — put another way, “fixing” a system in response to an incident to make it safer might actually be making it less safe.

    >>> Views of ‘cause’ limit the effectiveness of defenses against future events.

    >>> Post-accident remedies for “human error” are usually predicated on obstructing activities that can “cause” accidents. These end-of-the-chain measures do little to reduce the likelihood of further accidents. In fact that likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly. Instead of increasing safety, post-accident remedies usually increase the coupling and complexity of the system. This increases the potential number of latent failures and also makes the detection and blocking of accident trajectories more difficult.

    • albert_e 7 hours ago

      But that sounds like an assertion without evidence and underestimates the competence of everyone involved in designing and maintaining these complex systems.

      For example, take airline safety -- are we to believe based on the quoted assertion that every airline accident and resulting remedy that mitigated the causes have made air travel LESS safe? That sounds objectively, demonstrably false.

      Truly complex systems like ecosystems and climate might qualify for this assertion where humans have interfered, often with best intentions, but caused unexpected effects that maybe beyond human capacity control.

  • dosnem 12 hours ago

    How does knowing this help you avoid these problems? It doesn’t seem to provide any guidance on what to do in the face of complex systems

    • tptacek 12 hours ago

      He's literally writing about Three Mile Island. He doesn't have anything to tell you about what concurrency primitives to use for your distributed DNS management system.

      But: given finite resources, should you respond to this incident by auditing your DNS management systems (or all your systems) for race conditions? Or should you instead figure out how to make the Droplet Manager survive (in some degraded state) a partition from DynamoDB without entering congestive collapse? Is the right response an identification of the "most faulty components" and a project plan to improve them? Or is it closing the human expertise/process gap that prevented them from throttling DWFM for 4.5 hours?

      Cook isn't telling you how to solve problems; he's asking you to change how you think about problems, so you don't rathole in obvious local extrema instead of being guided by the bigger picture.

      • dekhn 10 hours ago

        It's entirely unclear to me if a system the size and scope of AWS could be re-thought using these principles and successfully execute a complete restructuring of all their processes to reduce their failure rate a bit. It's a system that grew over time with many thousands of different developers, with a need to solve critical scaling issues that would have stopped the business in its tracks (far worse than this outage).

      • dosnem 11 hours ago

        I don’t really follow what you are suggesting. If the system is complex and constantly evolving as the article states, you aren’t going to be able to close any expertise process gap. Operating in a degraded state is probably already built in, this was just a state of degradation they were not prepared for. You can’t figure out all degraded states to operate in because by definition the system is complex

      • doctorpangloss 10 hours ago

        Both documents are, "ceremonies for engineering personalities."

        Even you can't help it - "enumerating a list of questions" is a very engineering thing to do.

        Normal people don't talk or think like that. The way Cook is asking us to "think about problems" is kind of the opposite of what good leadership looks like. Thinking about thinking about problems is like, 200% wrong. On the contrary, be way more emotional and way simpler.

      • cyberax 10 hours ago

        Another point is that DWFM is likely working in a privileged, isolated network because it needs access deep into the core control plane. After all, you don't want a rogue service to be able to add a malicious agent to a customer's VPC.

        And since this network is privileged, observability tools, debugging support, and even maybe access to it are more complicated. Even just the set of engineers who have access is likely more limited, especially at 2AM.

        Should AWS relax these controls to make recovery easier? But then it will also result in a less secure system. It's again a trade-off.

  • user3939382 7 hours ago

    Nobody discussing the problem understands it.

  • sabareesh 11 hours ago

    [flagged]

    • Dylan16807 8 hours ago

      I picked a random bullet point to read (9) and I'm pretty sure it's complete nonsense. That's not an example of defensiveness leading to new problems.

      Doing this isn't helpful.

      Edit: Oops I looked at 8, it's also wrong, the Enactor setting the plan wasn't locally rational, it made a clear mistake. Also that claim has nothing to do with the rest of the paragraph! This output is so bad.

stefan_bobev 15 hours ago

I appreciate the details this went through, especially laying out the exact timelines of operations and how overlaying those timelines produces unexpected effects. One of my all time favourite bits about distributed systems comes from the (legendary) talk at GDC - I Shot You First[1] - where the speaker describes drawing sequence diagrams with tilted arrows to represent the flow of time and asking "Where is the lag?". This method has saved me many times, all throughout my career from making games, to livestream and VoD services to now fintech. Always account for the flow of time when doing a distributed operation - time's arrow always marches forward, your systems might not.

But the stale read didn't scare me nearly as much as this quote:

> Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues

Everyone can make a distributed system mistake (these things are hard). But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure. Maybe I am reading too much into it, maybe what they meant was that they didn't have a recovery procedure for "this exact" set of circumstances, but it is a little worrying even if that were the case. EC2 is one of the original services in AWS. At this point I expect it to be so battle hardened that very few edge cases would not have been identified. It seems that the EC2 failure was more impactful in a way, as it cascaded to more and more services (like the NLB and Lambda) and took more time to fully recover. I'd be interested to know what gets put in place there to make it even more resilient.

[1] https://youtu.be/h47zZrqjgLc?t=1587

  • tptacek 14 hours ago

    It shouldn't scare you. It should spark recognition. This meta-failure-mode exists in every complex technological system. You should be, like, "ah, of course, that makes sense now". Latent failures are fractally prevalent and have combinatoric potential to cause catastrophic failures. Yes, this is a runbook they need to have, but we should all understand there are an unbounded number of other runbooks they'll need and won't have, too!

    • lazystar 14 hours ago

      the thing that scares me is that AI will never be able to diagnose an issue that it has never seen before. If there are no runbooks, there is no pattern recognition. this is something Ive been shouting about for 2 years now; hopefully this issue makes AWS leadership understand that current gen AI can never replace human engineering.

      • tptacek 14 hours ago

        I'm much less confident in that assertion. I'm not bullish on AI systems independently taking over operations from humans, but catastrophic outages are combinations of less-catastrophic outages which are themselves combinations of latent failures, and when the latent failures are easy to characterize (as is the case here!), LLMs actually do really interesting stuff working out the combinatorics.

        I wouldn't want to, like, make a company out of it (I assume the foundational model companies will eat all these businesses) but you could probably do some really interesting stuff with an agent that consumes telemetry and failure model information and uses it to surface hypos about what to look at or what interventions to consider.

        All of this is besides my original point, though: I'm saying, you can't runbook your way to having a system as complex as AWS run safely. Safety in a system like that is a much more complicated process, unavoidably. Like: I don't think an LLM can solve the "fractal runbook requirement" problem!

      • janalsncm 4 hours ago

        AI is a lot more than just LLMs. Running through the rats nest of interdependent systems like AWS has is exactly what symbolic AI was good at.

      • Aeolun 6 hours ago

        I think millions of systems have failed due to missing DNS records though.

  • gtowey 5 hours ago

    It's shocking to me too, but not very surprising. It's probably a combination of factors that could cause a failure of planning and I've seen it play out the same way at lots of companies.

    I bet the original engineers planned for, and designed the system to be resilient to this cold start situation. But over time those engineers left, and new people took over -- those who didn't fully understand and appreciate the complexity, and probably didn't care that much about all the edge cases. Then, pushed by management to pursue goals that are antithetical to reliability, such as cost optimization and other things the new failure case was introduced by lots of sub optimal changes. The result is as we see it -- a catastrophic failure which caught everyone by surprise.

    It's the kind of thing that happens over and over again when the accountants are in charge.

  • throwdbaaway 4 hours ago

    > But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure.

    I guess they don't have a recovery procedure for the "congestive collapse" edge case. I have seen something similar, so I wouldn't be frowning at this.

    A couple of red flags though:

    1. Apparent lack of load-shedding support by this DWFM, such that a server reboot had to be performed. Need to learn from https://aws.amazon.com/builders-library/using-load-shedding-...

    2. Having DynamoDB as a dependency of this DWFM service, instead of something more primitive like Chubby. Need to learn more about distributed systems primitives from https://www.youtube.com/watch?v=QVvFVwyElLY

jasode a day ago

So the DNS records if-stale-then-needs-update it was basically a variation of the "2 Hard Things In Computer Science - cache invalidation". Excerpt from the giant paragraph:

>[...] Right before this event started, one DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints. As it was slowly working through the endpoints, several other things were also happening. First, the DNS Planner continued to run and produced many newer generations of plans. Second, one of the other DNS Enactors then began applying one of the newer plans and rapidly progressed through all of the endpoints. The timing of these events triggered the latent race condition. When the second Enactor (applying the newest plan) completed its endpoint updates, it then invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them. At the same time that this clean-up process was invoked, the first Enactor (which had been unusually delayed) applied its much older plan to the regional DDB endpoint, overwriting the newer plan. The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time due to the unusually high delays in Enactor processing. [...]

It outlines some of the mechanics but some might think it still isn't a "Root Cause Analysis" because there's no satisfying explanation of _why_ there were "unusually high delays in Enactor processing". Hardware problem?!? Human error misconfiguration causing unintended delays in Enactor behavior?!? Either the previous sequence of events leading up to that is considered unimportant, or Amazon is still investigating what made Enactor behave in an unpredictable way.

  • donavanm a day ago

    This is public messaging to explain the problem at large. This isnt really a post incident analysis.

    Before the active incident is “resolved” theres an evaluation of probable/plausible reoccurrence. Usually we/they would have potential mitigations and recovery runbooks prepared as well to quickly react to any reoccurance. Any likely open risks are actively worked to mitigate before the immediate issue is considered resolved. That includes around-the-clock dev team work if its the best known path to mitigation.

    Next any plausible paths to “risk of reoccurance” would be top dev team priority (business hours) until those action items are completed and in deployment. That might include other teams with similar DIY DNS management, other teams who had less impactful queue depth problems, or other similar “near miss” findings. Service team tech & business owners (PE, Sr PE, GM, VP) would be tracking progress daily until resolved.

    Then in the next few weeks at org & AWS level “ops meetings” there are going to be the in depth discussions of the incident, response, underlying problems, etc. the goal there being organizational learning and broader dissemination of lessons learned, action items, best practice etc.

  • Cicero22 16 hours ago

    my take away was that the race condition was the root cause. Take away that bug, and suddenly there's no incident, regardless of any processing delays.

    • _alternator_ 16 hours ago

      Right.sounds like it’s a case of “rolling your own distributed system algorithm” without the up front investment in implementing a true robust distributed system.

      Often network engineers are unaware of some of the tricky problems that DS research has addressed/solved in the last 50 years because the algorithms are arcane and heuristics often work pretty well, until they don’t. But my guess is that AWS will invest in some serious redesign of the system, hopefully with some rigorous algorithms underpinning the updates.

      Consider this a nudge for all you engineers that are designing fault tolerant distributed systems at scale to investigate the problem spaces and know which algorithms solve what problems.

      • foobarian 15 hours ago

        > some serious redesign of the system, hopefully with some rigorous algorithms underpinning the updates

        Reading these words makes me break out in cold sweat :-) I really hope they don't

      • dboreham 15 hours ago

        Certainly seems like misuse of DNS. It wasn't designed to be a rapidly updatable consistent distributed database.

        • pyrolistical 9 hours ago

          I think historically DNS was “best effort” but with consensus algorithms like raft, I can imagine a DNS that is perfectly consistent

      • withinboredom 12 hours ago

        Further, please don’t stop at RAFT. RAFT is popular because it is easy to understand, not because it is the best way to do distributed consensus. It is non-deterministic (thus requiring odd numbers of electors), requires timeouts for liveness (thus latency can kill you), and isn’t all that good for general-purpose consensus, IMHO.

  • dustbunny 18 hours ago

    Why is the "DNS Planner" and "DNS Enactor" separate? If it was one thing, wouldn't this race condition have been much more clear to the people working on it? Is this caused by the explosion of complexity due to the over use of the microservice architecture?

    • supportengineer 16 hours ago

      It probably was a single-threaded python script until somebody found a way to get a Promo out of it.

      • placardloop 10 hours ago

        This is Amazon we’re talking about, it was probably Perl.

    • neom 16 hours ago

      Pick your battle I'd guess. Given how huge AWS is, if you have Desired state vs. reconciler, you probably have more resilient operations generally and a easier job of finding and isolating problems, the flip side of that is if you screw up your error handling, you get this. That aside, it seems strange to me they didn't account for the fact that a stale plan could get picked up over a new one, so maybe I misunderstand the incident/architecture.

    • bananapub 16 hours ago

      > Why is the "DNS Planner" and "DNS Enactor" separate?

      for a large system, it's in practice very nice to split up things like that - you have one bit of software that just reads a bunch of data and then emits a plan, and then another thing that just gets given a plan and executes it.

      this is easier to test (you're just dealing with producing one data structure and consuming one data structure, the planner doesn't even try to mutate anything), it's easier to restrict permissions (one side only needs read access to the world!), it's easier to do upgrades (neither side depends on the other existing or even being in the same language), it's safer to operate (the planner is disposable, it can crash or be killed at any time with no problem except update latency), it's easier to comprehend (humans can examine the planner output which contains the entire state of the plan), it's easier to recover from weird states (you can in extremis hack the plan) etc etc. these are all things you appreciate more and more and your system gets bigger and more complicated.

      > If it was one thing, wouldn't this race condition have been much more clear to the people working on it?

      no

      > Is this caused by the explosion of complexity due to the over use of the microservice architecture?

      no

      it's extremely easy to second-guess the way other people decompose their services since randoms online can't see any of the actual complexity or any of the details and so can easily suggest it would be better if it was different, without having to worry about any of the downsides of the imagined alternative solution.

      • tuckerman 13 hours ago

        Agreed, this is a common division of labor and simplifies things. It's not entirely clear in the postmortem but I speculate that the conflation of duties (i.e. the enactor also being responsible for janitor duty of stale plans) might have been a contributing factor.

        The Oxide and Friends folks covered an update system they built that is similarly split and they cite a number of the same benefits as you: https://oxide-and-friends.transistor.fm/episodes/systems-sof...

        • jiggawatts 8 hours ago

          I would divide these as functions inside a monolithic executable. At most, emit the plan to a file on disk as a “—whatif” optional path.

          Distributed systems with files as a communication medium are much more complex than programmers think with far more failure modes than they can imagine.

          Like… this one, that took out a cloud for hours!

      • Anon1096 15 hours ago

        I mean any time a service goes down even 1/100 the size of AWS you have people crawling out of the woodworks giving armchair advice while having no domain relevant experience. It's barely even worth taking the time to respond. The people with opinions of value are already giving them internally.

        • lazystar 14 hours ago

          > The people with opinions of value are already giving them internally.

          interesting take, in light of all the brain drain that AWS has experienced over the last few years. some outside opinions might be useful - but perhaps the brain drain is so extreme that those remaining don't realize it's occurring?

    • jiggawatts 8 hours ago

      This was my thought also. The first sentences of the RCA screamed “race condition” without even having to mention the phrase.

      The two DNS components comprise a monolith: neither is useful without the other and there is one arrow on the design coupling them together.

      If they were a single component then none of this would have happened.

      Also, version checks? Really?

      Why not compare the current state against the desired state and take the necessary actions to bring them inline?

      Last but not least, deleting old config files so aggressively is a “penny wise pound foolish” design. I would keep these forever or at least a month! Certainly much, much longer than any possible time taken through the sequence of provisioning steps.

      • UltraSane 8 hours ago

        Yes it should be impossible for all DNS entries to get deleted like that.

  • ignoramous 15 hours ago

    > ...there's no satisfying explanation of _why_ there were "unusually high delays in Enactor processing". Hardware problem?

    Can't speak for the current incident but a similar "slow machine" issue once bit our BigCloud service (not as big an incident, thankfully) due to loooong JVM GC pauses on failing hardware.

  • mcmoor a day ago

    Also, I don't know if I missed it, but they don't establish anything to prevent outage if there's unusually high delay again?

    • mattcrox a day ago

      It’s at the end, they disabled the DDB DNS automations around this to fix before they re-enable them

      • mcmoor 4 hours ago

        If it's re enabled (without change?), wouldn't an unusually high delay break it again?

ecnahc515 15 hours ago

Seems like the enactor should be checking the version/generation of the current record before it applies the new value, to ensure it never applies an old plan on top of an record updated by a new plan. It wouldn't be as efficient, but that's just how it is. It's a basic compare and swap operation, so it could be handled easily within dynamodb itself where these records are stored.

baalimago 2 hours ago

Did they intentionally make it dense and complicated to discourage anyone from actually reading it..?

776 words in a single paragraph

al_be_back 9 hours ago

Postmortem all you want - the internet is breaking, hard.

The internet was born out of the need for Distributed networks during the cold war - to reduce central points of failure - a hedging mechanism if you will.

Now it has consolidated into ever smaller mono nets. A simple mistake in on one deployment could bring banking, shopping and travel to a halt globally. This can only get much worse when cyber warfare gets involved.

Personally, I think the cloud metaphor has overstretched and has long burst.

For R&D, early stage start-ups and occasional/seasonal computing, cloud works perfectly (similar to how time-sharing systems used to work).

For well established/growth businesses and gov, you better become self-reliant and tech independent: own physical servers + own cloud + own essential services (db, messaging, payment).

There's no shortage of affordable tech, know-how or workforce.

  • anyonecancode 7 hours ago

    > The internet was born out of the need for Distributed networks during the cold war - to reduce central points of failure - a hedging mechanism if you will.

    I don't think the idea was that in the event of catastrophe, up to and including nuclear attack, the system would continue working normally, but that it would keep working. And the internet -- as a system -- certainly kept working during this AWS outage. In a degraded state, yes, but it was working, and recovered.

    I'm more concerned with the way the early public internet promised a different kind of decentralization -- of economics, power, and ideas -- and how _that_ has become heavily centralized. In which case, AWS, and Amazon, indeed do make a good example. The internet, as a system, is certainly working today, but arguably in a degraded state.

  • protocolture 6 hours ago

    >the internet is breaking, hard.

    I dont see that this is the case, its just more people want services over the internet from the same 3 places that break irregularly.

    Internet infrastructure is as far as I can tell, getting better all the time.

    The last big BGP bug had 1/10th the comments of the AWS one. And had much less scary naming (ooooh routing instability)

    https://news.ycombinator.com/item?id=44105796

    >The internet was born out of the need for Distributed networks during the cold war - to reduce central points of failure - a hedging mechanism if you will.

    Instead of arguing about the need that birthed the internet, I will simply say that the internet still works in the same largely distributed fashion. Maybe you mean Web instead of Internet?

    The issue here is that "Internet" isnt the same as "Things you might access on the Internet". The Internet held up great during this adventure. As far as I can tell it was returning 404's and 502's without incident. The distributed networks were networking distributedly. If you wanted to send and received packets with any internet joined human in a way that didnt rely on some AWS hosted application, that was still very possible.

    >A simple mistake in on one deployment could bring banking, shopping and travel to a halt globally.

    Yeah but for how long and for how many people? The last 20 years have been a burn in test for a lot of big industries on crappy infrastructure. It looks like near everyone has been dragged kicking and screaming into the future.

    I mean the entire shipping industry got done over the last decade.

    https://www.zdnet.com/article/all-four-of-the-worlds-largest...

    >Personally, I think the cloud metaphor has overstretched and has long burst.

    It was never very useful.

    >For well established/growth businesses and gov, you better become self-reliant and tech independent

    For these businesses, they just go out and get themselves some region/vendor redundancy. Lots of applications fell over during this outage, but lots of teams are also getting internal praise for designing their systems robustly and avoiding its fallout.

    >There's no shortage of affordable tech, know-how or workforce.

    Yes, and these people often know how to design cloud infrastructure to avoid these issues, or are smart enough to warn people that if their region or its dependencies fail without redundancy, they are taking a nose dive. Businesses will make business decisions and review those decisions after getting publicly burnt.

JCM9 7 hours ago

Good to see a detailed summary. The frustration from a customer perspective is that AWS continues to have these cross-region issues and they continue to be very secretive about where these single points of failure exist.

The region model is a lot less robust if core things in other regions require US-East-1 to operate. This has been an issue in previous outages and appears to have struck again this week.

It is what it is, but AWS consistently oversells the robustness of regions as fully separate when events like Monday reveal they’re really not.

  • Arainach 7 hours ago

    >about where these single points of failure exist.

    In general, when you find one you work to fix it, and one of the most common ways to find more is when one of them fails. Having single points of failure and letting them live isn't the standard practice at this scale.

gslin a day ago

I believe a report with timezone not using UTC is a crime.

  • tguedes 15 hours ago

    I think it makes sense in this instance. Because this occurred in us-east-1, the vast majority of affected customers are US based. For most people, it's easier to do the timezone conversion from PT than UTC.

    • thayne 11 hours ago

      But us-east-1 is in Eastern Time, so if you aren't going to use UTC, why not use that?

      I'm guessing PT was chosen because the people writing this report are in PT (where Amazon headquarters is).

    • trenchpilgrim 13 hours ago

      us-east-1 is an exceptional Amazon region; it hosts many global services as well as services which are not yet available in other regions. Most AWS customers worldwide probably have an indirect dependency on us-east-1.

  • cheeze 17 hours ago

    My guess is that PT was chosen to highlight the fact that this happened in the middle of the night for most of the responding ops folks.

    (I don't know anything here, just spitballing why that choice would be made)

    • throitallaway 15 hours ago

      Their headquarters is in Seattle (Pacific Time.) But yeah, I hate time zones.

__turbobrew__ 14 hours ago

From a meta analysis level: bugs will always happen, formal verification is hard, and sometimes it just takes a number of years to have some bad luck (I have hit bugs which were over 10 years old but due to low probability of them occurring they didn’t happen for a long time).

If we assume that the system will fail, I think the logical thing to think about is how to limit the effects of that failure. In practice this means cell based architecture, phased rollouts, and isolated zones.

To my knowledge AWS does attempt to implement cell based architecture, but there are some cross region dependencies specifically with us-east-1 due to legacy. The real long term fix for this is designing regions to be independent of each other.

This is a hard thing to do, but it is possible. I have personally been involved in disaster testing where a region was purposely firewalled off from the rest of the infrastructure. You find out very quick where those cross region dependencies lie, and many of them are in unexpected places.

Usually this work is not done due to lack of upper level VP support and funding, and it is easier to stick your head in the sand and hope bad things don’t happen. The strongest supporters of this work are going to be the share holders who are in it for the long run. If the company goes poof due to improper disaster testing, the shareholders are going to be the main bag holders. Making the board aware of the risks and the estimated probability of fundamentally company ending events can help get this work funded.

shayonj a day ago

I was kinda surprised the lack of CAS on per-endpoint plan version or rejecting stale writes via 2PC or single-writer lease per endpoint like patterns.

Definitely a painful one with good learnings and kudos to AWS for being so transparent and detailed :hugops:

  • donavanm 21 hours ago

    See https://news.ycombinator.com/item?id=45681136. The actual DNS mutation API does, effectively, CAS. They had multiple unsynchronized writers who raced without logical constraints or ordering to teh changes. Without thinking much they _might_ have been able to implement something like a vector either through updating the zone serial or another "sentinel record" that was always used for ChangeRRSets affecting that label/zone; like a TXT record containing a serialized change set number or a "checksum" of the old + new state.

    Im guessing the "plans" aspect skipped that and they were just applying intended state, without trying serialize them. And last-write-wins, until it doesnt.

    • cyberax 10 hours ago

      Oh, I can see it from here. AWS internally has a problem with things like task orchestration. I bet that the enactor can be rewritten as a goroutine/thread in the planner, with proper locking and ordering.

      But that's too complicated and results in more code. So they likely just used an SQS queue with consumers reading from it.

grogers 7 hours ago

> As this plan was deleted, all IP addresses for the regional endpoint were immediately removed.

I feel like I am missing something here... They make it sound like the DNS enactor basically diffs the current state of DNS with the desired state, and then submits the adds/deletes needed to make the DNS go to the desired state.

With the racing writers, wouldn't that have just made the DNS go back to an older state? Why did it remove all the IPs entirely?

  • Aeolun 6 hours ago

    1. Read state, oh, I need to delete all this.

    2. Read state, oh, I need to write all this.

    2. Writes

    1. Deletes

    Or some variant of that anyway. It happens in any system that has concurrent reader/writers and no locks.

everfrustrated 20 hours ago

>Services like DynamoDB maintain hundreds of thousands of DNS records to operate a very large heterogeneous fleet of load balancers in each Region

Does that mean a DNS query for dynamodb.us-east-1.amazonaws.com can resolve to one of a hundred thousand IP address?

That's insane!

And also well beyond the limits of route53.

I'm wondering if they're constantly updating route53 with a smaller subset of records and using a low ttl to somewhat work around this.

  • supriyo-biswas 17 hours ago

    DNS-based CDNs are also effectively this: collect metrics from a datastore regarding system usage metrics, packet loss, latency etc and compute a table of viewer networks and preferred PoPs.

    Unfortunately hard documentation is difficult to provide but that’s how a CDN worked at a place I used to work for, there’s also another CDN[1] which talks about the same thing in fancier terms.

    [1] https://bunny.net/network/smartedge/

    • donavanm 11 hours ago

      Akamai talked about it in the early 2000s. Facebook content folks had a decent paper describing the latency collection and realtime routing around 2011ish, something like “pinpoint” I want to say. Though as you say was industry practice before then.

  • donavanm 11 hours ago

    Some details, but yeah that's basically how all AWS DNS works. I think youre missing how labels, zones, and domains are related but distinct. And that R53 operates in resource record SETS. And there are affordances in the set relationships to build trees and logic for selecting an appropriate set (eg healthcheck, latency).

    > And also well beyond the limits of route53

    Ipso facto, R53 can do this just fine. Where do you think all of your public EC2, ELB, RDS, API Gateway, etc etc records are managed and served?

  • thayne 11 hours ago

    I haven't tested with dynamodb, but I once ran a loop of doing DNS lookups for s3, and I in a couple seconds I got hundreds of distinct ip addresses. And that was just for a single region, from a single source ip.

  • rescbr 8 hours ago

    > And also well beyond the limits of route53.

    One thing is the internal limit, another thing is the customer-facing limit.

    Some hard limits are softer than they appear.

lazystar a day ago

> Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues.

interesting.

pelagicAustral 16 hours ago

Had no idea Dynamo was so intertwined with the whole AWS stack.

  • freedomben 16 hours ago

    Yeah, for better or worse, AWS is a huge dogfooder. It's nice to know they trust their stuff enough to depend on it themselves, but it's also scary to know that the blast radius of a failure in any particular service can be enormous

ericpauley a day ago

Interesting use of the phrase “Route53 transaction” for an operation that has no hard transactional guarantees. Especially given the lack of transactional updates are what caused the outage…

  • donavanm 21 hours ago

    I think you misunderstnad the failure case. The ChangeResourceRecordSet is transactional (or was when I worked on the service) https://docs.aws.amazon.com/Route53/latest/APIReference/API_....

    The fault was two different clients with divergent goal states:

    - one ("old") DNS Enactor experienced unusually high delays needing to retry its update on several of the DNS endpoints

    - the DNS Planner continued to run and produced many newer generations of plans [Ed: this is key: its producing "plans" of desired state, the does not include a complete transaction like a log or chain with previous state + mutations]

    - one of the other ("new") DNS Enactors then began applying one of the newer plans

    - then ("new") invoked the plan clean-up process, which identifies plans that are significantly older than the one it just applied and deletes them [Ed: the key race is implied here. The "old" Enactor is reading _current state_, which was the output of "new", and applying its desired "old" state on top. The discrepency is because apparently Planer and Enactor aren't working with a chain/vector clock/serialized change set numbers/etc]

    - At the same time the first ("old") Enactor ... applied its much older plan to the regional DDB endpoint, overwriting the newer plan. [Ed: and here is where "old" Enactor creates the valid ChangeRRSets call, replacing "new" with "old"]

    - The check that was made at the start of the plan application process, which ensures that the plan is newer than the previously applied plan, was stale by this time [Ed: Whoops!]

    - The second Enactor’s clean-up process then deleted this older plan because it was many generations older than the plan it had just applied.

    Ironically Route 53 does have strong transactions of API changes _and_ serializes them _and_ has closed loop observers to validate change sets globally on every dataplane host. So do other AWS services. And there are even some internal primitives for building replication or change set chains like this. But its also a PITA and takes a bunch of work and when it _does_ fail you end up with global deadlock and customers who are really grumpy that they dont see their DNS changes going in to effect.

    • RijilV 16 hours ago

      Not for nothing, there’s a support group for those of us who’ve been hurt by WHU sev2s…

      • donavanm 11 hours ago

        Man I always hated that phrasing; always tried to get people to use more precise terms like “customer change propagation.” But yeah, who hasnt been punished by a queryplan change or some random connectivity problem in south east asia!

yla92 a day ago

So the root cause is basically race condition 101 stale read ?

  • philipwhiuk a day ago

    Race condition and bad data validation.

dilyevsky 13 hours ago

Sounds like they went with Availability over Correctness with this design but the problem is that if your core foundational config is not correct you get no availability either.

WaitWaitWha 16 hours ago

I gather, the root cause was a latent race condition in the DynamoDB DNS management system that allowed an outdated DNS plan to overwrite the current one, resulting in an empty DNS record for the regional endpoint.

Correct?

  • tptacek 15 hours ago

    I think you have to be careful with ideas like "the root cause". They underwent a metastable congestive collapse. A large component of the outage was them not having a runbook to safely recover an adequately performing state for their droplet manager service.

    The precipitating event was a race condition with the DynamoDB planner/enactor system.

    https://how.complexsystems.fail/

    • 1970-01-01 13 hours ago

      Why can't a race condition bug be seen as the single root cause? Yes, there were other factors that accelerated collapse, but those are inherent to DNS, which is outside the scope of a summary.

      • tptacek 13 hours ago

        Because the DNS race condition is just one flaw in the system. The more important latent flaw† is probably the metastable failure mode for the droplet manager, which, when it loses connectivity to Dynamo, gradually itself loses connectivity with the Droplets, until a critical mass is hit where the Droplet manager has to be throttled and manually recovered.

        Importantly: the DNS problem was resolved (to degraded state) in 1hr15, and fully resolved in 2hr30. The Droplet Manager problem took much longer!

        This is the point of complex failure analysis, and why that school of thought says "root causing" is counterproductive. There will always be other precipitating events!

        which itself could very well be a second-order effect of some even deeper and more latent issue that would be more useful to address!

        • 1970-01-01 13 hours ago

          Two different questions here.

          1. How did it break?

          2. Why did it collapse?

          A1: Race condition

          A2: What you said.

          • tptacek 12 hours ago

            What is the purpose of identifying "root causes" in this model? Is the root cause of a memory corruption vulnerability holding a stale pointer to a freed value, or is it the lack of memory safety? Where does AWS gain more advantage: in identifying and mitigating metastable failure modes in EC2, or in trying to identify every possible way DNS might take down DynamoDB? (The latter is actually not an easy question, but that's the point!)

            • 1970-01-01 12 hours ago

              Two things can be important for an audience. For most, it's the race condition lesson. Locks are there for a reason. For AWS, it's the stability lesson. DNS can and did take down the empire for several hours.

              • tptacek 12 hours ago

                Did DNS take it down, or did a pattern of latent failures take it down? DNS was restored fairly quickly!

                Nobody is saying that locks aren't interesting or important.

                • nickelpro 4 hours ago

                  The Droplet lease timeouts were an aggravating factor for the severity of the incident, but are not causative. Absent a trigger the droplet leases never experience congestive failure.

                  The race condition was necessary and sufficient for collapse. Absent corrective action it always leads to AWS going down. In the presence of corrective actions the severity of the failure would have been minor without other aggravating factors, but the race condition is always the cause of this failure.

                • dosnem 11 hours ago

                  This doesn’t really matter. This type of error gets the whole 5 why’s treatment and every why needs to get fixed. Both problems will certainly have an action item

        • cyberax 10 hours ago

          The droplet manager failure is a lot more forgivable scenario. It happened because the "must always be up" service went down for an extended period of time, and the sheer amount of actions needed for the recovery overwhelmed the system.

          The initial DynamoDB DNS outage was much worse. A bog-standard TOCTTOU for scheduled tasks that are assumed to be "instant". And the lack of controls that allowed one task to just blow up everything in one of the foundational services.

          When I was at AWS some years ago, there were calls to limit the blast radius by using cell architecture to create vertical slices of the infrastructure for critical services. I guess that got completely sidelined.

qrush 17 hours ago

Sounds like DynamoDB is going to continue to be a hard dependency for EC2, etc. I at least appreciate the transparency and hearing about their internal systems names.

  • offmycloud 16 hours ago

    I think it's time for AWS to pull the curtain back a bit and release a JSON document that shows a list of all internal service dependencies for each AWS service.

    • mparnisari 7 hours ago

      I worked for AWS for two years and if I recall correctly, one of the issues was circular dependencies.

    • throitallaway 15 hours ago

      Would it matter? Would you base decisions on whether or not to use one of their products based on the dependency graph?

      • UltraSane 8 hours ago

        It would let you know that if if service A and B both depend on service C you can't use A and B to gain reliability.

      • withinboredom 12 hours ago

        Yes.

        • bdangubic 10 hours ago

          if so, I hate to tell you this but you would not use AWS (or any other cloud provider)!

          • withinboredom 4 hours ago

            I don’t use AWS or any other cloud provider. I use bare metal since 2012. See, in 2012 (IIRC), one fateful day, we turned off our bare metal machines and went full AWS. That afternoon, AWS had its first major outage. Prior to that day, the owner could walk in and ask what we were doing about it. That day, all we could do was twiddle our thumbs or turn on a now outdated database replica. Surely AWS won’t be out for hours, right? Right? With bare metal, you might be out for hours, but you can quickly get back to a degraded state, no matter what happens. With AWS, you’re stuck with whatever they happen to fix first.

    • cyberax 10 hours ago

      A lot of internal AWS services have names that are completely opaque to outside users. Such a document will be pretty useless as a result.

  • UltraSane 8 hours ago

    They should at least split off dedicated isolated instances of DynamoDB to reduce blast radius. I would want at least 2 instances for every internal AWS service that uses it.

  • skywhopper 13 hours ago

    I mean, something has to be the baseline data storage layer. I’m more comfortable with it being DynamoDB than something else that isn’t pushed as hard by as many different customers.

    • UltraSane 8 hours ago

      The actual storage layer of DynamoDB is well engineered and has some formal proofs.

827a 10 hours ago

I made it about ten lines into this before realizing that, against all odds, I wasn't reading a postmortem, I was reading marketing material designed to sell AWS.

> Many of the largest AWS services rely extensively on DNS to provide seamless scale, fault isolation and recovery, low latency, and locality...

  • Aeolun 6 hours ago

    I didn’t get 10 lines in before I realized that this wall of text couldn’t possibly contain the actual reason. Somewhere behind all of that is an engineer saying “We done borked up and deleted the dynamodb DNS records”

joeyhage a day ago

> as is the case with the recently launched IPv6 endpoint and the public regional endpoint

It isn't explicitly stated in the RCA but it is likely these new endpoints were the straw that broke the camel's back for the DynamoDB load balancer DNS automation

martythemaniak 7 hours ago

It's not DNS There's no way it's DNS It was DNS

Velocifyer 13 hours ago

This is unreadable and terribly formatted.

  • citizenpaul 6 hours ago

    Yeah for real thats what an "industry leading" company puts out for their post mortem? They should be red in the face embarrassed. Jeeze, paragraphs? Punctuation?

    Looks like Amazon is starting to show cracks in the foundation.

  • citizenpaul 6 hours ago

    Yeah for real thats what an "industry leading" company puts out for their post mortem? They should be red in the face embarrassed. Jeeze paragraphs? Punctuation?

    I put more effort into my internet comments that won't be read by millions of people.

galaxy01 a day ago

Would conditional read/write solve this? looks like some kind of stale read

bithavoc 14 hours ago

does DynamoDB run on EC2? if I read it right, EC2 depends on DynamoDB.

  • dokument 13 hours ago

    There are circular dependencies within AWS, but also systems to account for this (especially for cold starting).

    Also there really is no one AWS, each region is its own (Now more then ever before, where some systems weren't built to support this).

alexnewman 13 hours ago

Is it the internal dynamodb that other people use?

LaserToy 18 hours ago

TLDR: A DNS automation bug removed all the IP addresses for the regional endpoints. The tooling that was supposed to help with recovery depends on the system it needed to recover. That’s a classic “we deleted prod” failure mode at AWS scale.

shrubble 15 hours ago

The Bind resolver required each zone to have an increasing serial number for the zone.

So if you made a change you had to increase the number, usually a timestamp like 20250906114509 which would be older / lower numbered than 20250906114702; making it easier to determine which zone file had the newest data.

Seems like they sort of had the same setup but with less rigidity in terms of refusing to load older files.