GitHub Outages Since Microslop Acquisition

0x0@lemmy.zip · 14 hours ago

GitHub Outages Since Microslop Acquisition

merc@sh.itjust.works · 53 minutes ago

I’ve worked on services with 5 nines of availability (i.e. 99.999% available, less than 5 minutes of downtime allowed per year). I’ve more frequently worked on ones with 4 nines, where you’re allowed almost an hour of downtime per year. GitHub is now barely maintaining 2 nines. That’s just embarrassing.

Each “nine” you add is much more difficult. To get four nines you need people on call who can start working on a problem within 5 minutes and fix it within a few more minutes, and you can only get those calls once every couple of months. Five nines means that you need people at their desks in shifts ready to start fixing something the moment there’s a problem because it would take too long for someone on-call to get their computer out, connect and authenticate. It requires warm backup systems that are sitting idle but ready to take over fully at a moment’s notice.

A two nines system is allowed to be down for 100x as long as a four nines system, and 1000x as long as a five nines system. It’s almost 15 minutes of downtime allowed per day, compared to about 15 minutes every 3 months for a four-nines system. Gamers wouldn’t even put up with a two-nines system for a video game. It’s absurd to allow that for a critical piece of infrastructure for software.

paris@lemmy.blahaj.zone · edit-2 4 hours ago

https://damrnelson.github.io/github-historical-uptime/

A lot of this is GitHub Actions alone, but a lot of it isn’t. I also don’t know how well GitHub tracked outages before the Microsoft acquisition. It’s entirely possible the graph looks so bad because they only took outage tracking seriously after being acquired. I don’t know.

Further related discussion on Hacker News

bitjunkie@lemmy.world · 3 hours ago

That’s just fucking disgraceful.

Safeguard@beehaw.org · 6 hours ago

Is that real? Because that… Makes it real clear…

raspberriesareyummy@lemmy.world · 10 hours ago

Nothing to make a point like snipping off the y-axis scaling.

I hate Microslop like any person with > 2 brain cells, but that graph is useless - all visible y-entries end in a 0 - might as well be 99.990, 99.980, 99.970, …

Jordan117@lemmy.world · 9 hours ago

It’s just Xitter’s image viewer cropping it automatically; the original upload has it.

prenatal_confusion@feddit.org · 6 hours ago

It is still bad practice to select a narrow window from a axis like this and show the difference that seems massive relative to what is shown but isn’t that significant when we can see the relation to the whole.

Graph 101

9point6@lemmy.world · 13 hours ago

AnUnusualRelic@lemmy.world · 6 hours ago

Lies! 89.98% has two nines in it!

Damage@feddit.it · edit-2 12 hours ago

I see two nines

huquad@lemmy.ml · 12 hours ago

Microsoft never promised where the nines would be

chellomere@lemmy.world · 7 hours ago

0.99%

itrealgood@mander.xyz · 11 hours ago

Clearly you are not an llm then

cyberduck@aussie.zone · 12 hours ago

I’m stupid what does zero nines uptime mean?

ByteJunk@lemmy.world · edit-2 10 hours ago

When contracting a service, usually there are clauses that specify that it needs to be fully working and available x% of time, and compensation may be due in case this goal isn’t met.

Let’s say GitHub was down for 1 full day in the last year, that’s 99.7% availability. That’s “2 nines”, but sometimes people might say “2 nines five”, meaning “better than 99.5% uptime”.

I’d say that the expectation for a high availability service nowadays is “5 nines”: 99.999% uptime. That’s around 5 minutes of downtime in a full year. This kind of performance from a site like GitHub is just unacceptable…

0x0@lemmy.zip · 12 hours ago

These services measure their uptimes in number of nines, the more the better.

dohpaz42@lemmy.world · 12 hours ago

Sometimes the humorous term “nine fives” (55.5555555%) is used to contrast with “five nines” (99.999%),[18][19][20] though this is not an actual goal….

Maybe Microsoft misunderstood the assignment, and thought this was a goal. At their current rate, it’s certainly more achievable than the more traditional “five nines”.

As an aside, I love how the following is preferences as “casual”, and then the author starts arguing semantics:

Similarly, percentages ending in a 5 have conventional names, traditionally the number of nines, then “five”, so 99.95% is “three nines five”, abbreviated 3N5.[13][14] This is casually referred to as “three and a half nines”,[15] but this is incorrect….

Tiresia@slrpnk.net · 2 hours ago

Being casual does not shield you from your mathematical incorrectness.

Hazel@piefed.blahaj.zone · 11 hours ago

the author starts arguing semantics

Legendary levels of pedantry, gave me a real good chuckle 🤭

cyberduck@aussie.zone · 12 hours ago

Ah makes sense. Thanks

raspberriesareyummy@lemmy.world · 10 hours ago

Thank you, that is much more helpful than OP graph

frank@sopuli.xyz · 11 hours ago

Move slow and break shit

InvalidName2@lemmy.zip · 6 hours ago

It’s the best of both worsts.

DahGangalang@infosec.pub · 13 hours ago

Obv a gross looking chart, but I am bothered that the left hand scale is trimmed off. I expect those are 10% increments, but wouldn’t be shocked if Original was like 99.0, 98.0, 97.0, etc.

vogi@piefed.social · 13 hours ago

You’d be surprised: https://damrnelson.github.io/github-historical-uptime/

But weirdly enough it feels much worse using gh professionally than the scale makes it seem.

lemmyman@lemmy.world · edit-2 11 hours ago

The graph is neat.

Saving some people a click: the cut-off y scale in the OP image is in 0.1% increments. So the lowest point is a little above 99.5%

raspberriesareyummy@lemmy.world · 10 hours ago

Thank you! I was thinking “it can’t just be me that’s bothered”

MBech@feddit.dk · edit-2 12 hours ago

How does this corrospond with growth? I imagine having 100% uptime is much harder the bigger a platform is, so did Github grow a lot in the same period?

I’m not questioning wether or not Microsoft has issues, I just find it relevant wether or not they very suddenly saw a 2000% increase in server usage or something.

jatone@lemmy.dbzer0.com · edit-2 11 hours ago

I imagine having 100% uptime is much harder the bigger a platform is, so did Github grow a lot in the same period?

its not there are scale points where once you hit a critical number you need to re-architect your backend. 1k,10k,1mil, etc. usually these vary based on your app. but they’re usually exponential so once you hit the higher levels it takes much longer to reach the next level.

on top of that you usually by the higher tiers have proper backpressure and signals being sent to the frontend systems to dynamically manage the load generated. so suddenly uptime is much easier.

when you see large repeated failures like this the cause is almost always corporate causing issues.

reducing engineering budget.
not listening to engineering department on product decisions. (see the recent product manager AI generated commit that got merged and caused a mild uproar of 'co authored by copilot)
rushing nonsense out before its ready.

it this particular case i bet it cutting engineering head count and increase AI slop generated code without proper review by engineers. which ive been hearing a lot more from my engineering friends.