• merc@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    2
    ·
    53 minutes ago

    I’ve worked on services with 5 nines of availability (i.e. 99.999% available, less than 5 minutes of downtime allowed per year). I’ve more frequently worked on ones with 4 nines, where you’re allowed almost an hour of downtime per year. GitHub is now barely maintaining 2 nines. That’s just embarrassing.

    Each “nine” you add is much more difficult. To get four nines you need people on call who can start working on a problem within 5 minutes and fix it within a few more minutes, and you can only get those calls once every couple of months. Five nines means that you need people at their desks in shifts ready to start fixing something the moment there’s a problem because it would take too long for someone on-call to get their computer out, connect and authenticate. It requires warm backup systems that are sitting idle but ready to take over fully at a moment’s notice.

    A two nines system is allowed to be down for 100x as long as a four nines system, and 1000x as long as a five nines system. It’s almost 15 minutes of downtime allowed per day, compared to about 15 minutes every 3 months for a four-nines system. Gamers wouldn’t even put up with a two-nines system for a video game. It’s absurd to allow that for a critical piece of infrastructure for software.

  • raspberriesareyummy@lemmy.world
    link
    fedilink
    English
    arrow-up
    27
    ·
    10 hours ago

    Nothing to make a point like snipping off the y-axis scaling.

    I hate Microslop like any person with > 2 brain cells, but that graph is useless - all visible y-entries end in a 0 - might as well be 99.990, 99.980, 99.970, …

      • ByteJunk@lemmy.world
        link
        fedilink
        English
        arrow-up
        25
        ·
        edit-2
        10 hours ago

        When contracting a service, usually there are clauses that specify that it needs to be fully working and available x% of time, and compensation may be due in case this goal isn’t met.

        Let’s say GitHub was down for 1 full day in the last year, that’s 99.7% availability. That’s “2 nines”, but sometimes people might say “2 nines five”, meaning “better than 99.5% uptime”.

        I’d say that the expectation for a high availability service nowadays is “5 nines”: 99.999% uptime. That’s around 5 minutes of downtime in a full year. This kind of performance from a site like GitHub is just unacceptable…

        • dohpaz42@lemmy.world
          link
          fedilink
          English
          arrow-up
          39
          ·
          12 hours ago

          Sometimes the humorous term “nine fives” (55.5555555%) is used to contrast with “five nines” (99.999%),[18][19][20] though this is not an actual goal….

          Maybe Microsoft misunderstood the assignment, and thought this was a goal. At their current rate, it’s certainly more achievable than the more traditional “five nines”.

          As an aside, I love how the following is preferences as “casual”, and then the author starts arguing semantics:

          Similarly, percentages ending in a 5 have conventional names, traditionally the number of nines, then “five”, so 99.95% is “three nines five”, abbreviated 3N5.[13][14] This is casually referred to as “three and a half nines”,[15] but this is incorrect….

  • DahGangalang@infosec.pub
    link
    fedilink
    English
    arrow-up
    46
    ·
    13 hours ago

    Obv a gross looking chart, but I am bothered that the left hand scale is trimmed off. I expect those are 10% increments, but wouldn’t be shocked if Original was like 99.0, 98.0, 97.0, etc.

  • MBech@feddit.dk
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    12 hours ago

    How does this corrospond with growth? I imagine having 100% uptime is much harder the bigger a platform is, so did Github grow a lot in the same period?

    I’m not questioning wether or not Microsoft has issues, I just find it relevant wether or not they very suddenly saw a 2000% increase in server usage or something.

    • jatone@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      17
      ·
      edit-2
      11 hours ago

      I imagine having 100% uptime is much harder the bigger a platform is, so did Github grow a lot in the same period?

      its not there are scale points where once you hit a critical number you need to re-architect your backend. 1k,10k,1mil, etc. usually these vary based on your app. but they’re usually exponential so once you hit the higher levels it takes much longer to reach the next level.

      on top of that you usually by the higher tiers have proper backpressure and signals being sent to the frontend systems to dynamically manage the load generated. so suddenly uptime is much easier.

      when you see large repeated failures like this the cause is almost always corporate causing issues.

      • reducing engineering budget.
      • not listening to engineering department on product decisions. (see the recent product manager AI generated commit that got merged and caused a mild uproar of 'co authored by copilot)
      • rushing nonsense out before its ready.

      it this particular case i bet it cutting engineering head count and increase AI slop generated code without proper review by engineers. which ive been hearing a lot more from my engineering friends.