Here are the details about what went wrong on Friday.
I feel like that’s not even close to what the real number is, considering the impact it had.
If this figure is accurate, the massive impact was likely due to collateral damages. If this took down every server at an enterprise and left most of the workstations online, then that still means that those workstations were basically paperweights.
They have about 24,000 clients so that comes out to around 350 impacted machines per client which is reasonable. It only takes a few impacted machines for thousands of people to be impacted if they are important enough.
My bothers work uses VMs so if the server is down there’s probably 50k computers right there. But it’s only 1 affected computer.
As far as I know, none of the OSes used for virtualization hosts at scale by any of the major cloud infra players are Windows.
Not to mention: any company that uses any AWS or azure or GCP service is “using VMs” in one form or another (yes, I know I am hand waving away the difference between VMs and containers). It’s basically what they build all of their other services on.
Banks use VMs and banks were down without access to their systems to login into the VM, so they could work. They were bricked by extension.
No, the clients were bricked. The VMs themselves were probably fine - and in fact, probably auto-rollbacked the update to a working savepoint after the update failed (assuming the VM infrastructure was properly set up).
He couldn’t login to the VM to access his work portals or emails, call it what you will, but one bricked computer/server affected thousands.
It’s weird that you’re arguing, but asked how it was possible in the first place. VMs are the answer dude, argue all you want, but it’s making you look foolish for A not understanding, and B arguing against the answer. Also, why this one thread? Multiple other people told you the exact same thing. You just looking for an argument here or something?
No, but HyperV is used extensively in the SMB space.
VMWare is popular for a reason, but its also insanely expensive if you only need an AD server and a file share.
I wonder if a large percentage of impact is internal facing systems.
And we won’t know until Monday.
That’s how supply chains work. A link in the chain is broken, the whole thing doesn’t work. Also 10% of major companies being affected, is still giant. But you’re here using online services, probably still buying bread probably got fuel, probably playing video games. It’s huge in the media, and it saw massive affects but there’s heaps of things that just weren’t even touched that information spread on. Like TV news networks seemingly kept going enough to report on it non stop unaffected. Tbh though any good continuity and disaster recovery plan should handle this with impact but continuity.
The only companies I have seen with workable BCDR plans are banks, and that is because they handle money for rich people. It wouldn’t surprise me if many core banking systems are hyper-legacy as well.
I honestly think that a majority of our infrastructure didn’t collapse because of the lack of security controls and shitty patch management programs.
Sure. Compliance programs work for some aspects of business but since the advent of “the cloud”, BCDR plans have been a paperwork drill.
(There are probably some awesome places out there with quadruple-redunant networks with the ability to outlast a nuclear winter. I personally haven’t seen them though.)
It’s impossible to tell and you’re probably more close to the truth than not.
One fact alone, bcdr isn’t an IT responsibility. Business continuity should be inclusive of things like: when your CNC machine no longer has power, what do you do? Cause 1: power loss. Process: Get the diesel generator backup running following that SOP. Cause 2:broken. Process: Get the mechanic over, or get the warranty action item list. Rely on the SLA for maintenance. Cause 3: network connectivity. Process: use USB following SOP.
I’ve been a part of a half dozen or more of these over time, which is not that many for over 200 companies I’ve supported.
I’ve even done simulations, round table “Dungeons and dragons” style with a person running the simulation. Where different people have to follow the responsibilities in their documented process. Be it calling clients and customers and vendors, or alerting their insurance, or positing to social media, all the way through to the warehouse manager using a Biro, ruler, and creating stock incoming and outgoing by hand until systems are operational again.
So I only mention this because you talk about IT redundancy, but business continuity is not an IT responsibility, although it has a role. It’s a business responsibility.
Further kind of proving your point since anyone who’s worked a decade without being part of a simulation or contribute to their improvement at least, probably proves they’ve worked at companies who don’t do them. Which isn’t their fault but it’s an indicator of how fragile business is and how little they are accountable for it.
You aren’t wrong about my description. My direct experience with compliance is limited to small/medium tech companies where IT is the business. As long as there is an alternate work location and tech redundancy, the business can chug along as usual. (Data centers are becoming more rare so cloud redundancy is more important than ever.) Of course, there is still quite a bit that needs to be done depending on the type of emergency, as you described: It’s just all IT, customer and partner centric.
Unfortunately, that does make compliance an IT function because a majority of the company is in some IT engineering function, less sales and marketing.
I can’t speak to companies in different industries whereas you can. When physical products and manufacturing is at stake, that is way out of scope with what I could deal with.
Hmm, yeah. Thanks for sharing. Because of 15 odd years of IT Managed Services, I only have non-technical companies on the brain and in my world view I hadn’t considered technology provider companies at all. They typically don’t need managed service providers (right or wrong :p).
It gets worse. Tech companies are service providers that typically work with a chain of other service providers. About 40%-50% of the controls for the last SOC2 audit I ran was carved out and deferred to our service providers. (Also, there are limited applicable frameworks: SOC2, PCI, ISO-270001, HIPAA and HITRUST are common for me, but usually related to cloud services.)
Yeah, I tend to break the brains of auditors that have never dealt with startups and have been used to Fortune 500 mega-companies. What’s funnier, is that I am just a lowly security engineer. A very experienced security engineer, but a lowly one nonetheless.
Auditor: So what is your documented process for this ?
Me: Uhh, we don’t have one?
Auditor: What about when X or Y catastrophic issue happens?
Me: Anyone just pushes this button and activates that widget.
Auditor: Ok. Uh. Is that process documented?
Me: Nope. We probably do it about 2-3 times a week anyway.
Yeah we do a lot around frameworks at my current place, and previously we worked directly with customers with iso and acsc essential 8 frameworks. For us, non-compliance = revenue opportunity. That means we are financially rewarded for aligning them and encouraged to do so. On that same note I wrote up a checklist for “sysadmin best practices” aimed for driving reviews and checks and Remedial opportunities for small businesses, useful in that space. I got such an overwhelming amount of response in the msp reddit from people asking in DMs about it (not hundreds, just dozens, too many for me though). It’s quiet here in lemmy. Happy to share my updated version of course, just I think if you’re dealing in your sector it’ll look like childs play lol. But I kind of want to encourage a bit of community within professionals here. I just don’t want do spend time on it…
I feel you about the lowly experienced officer bit though. An account manager or business development manager, or even CTO won’t listen to me. I have a business degree, most of them don’t. I try to apply critical decision making in my solutions and risk advisory. But the words fall on deaf ears. I take a small but very guilty pleasure watching the very thing I warn against, happening both to clients and my employers. Especially when the prevention was trivial but all it needed was any amount of attention.
After nearly 20 years of IT and about 15 in MSP I’m so tired. I’m very much resonating with that “lowly engineer” comment.
CrowdStrike lives up to its name
Hey Crowdstrike…
That’s not imposter syndrome you’re feeling right now.
*crowdstrike
It’s not imposter syndrome for me either. At least I didn’t bring down millions of systems all across the world
My bad
Sorry about all those blue screens
This number seems quite low. My organisation alone would have had something like 3000 employee devices taken down. Since it happened on a day where most people WFH, there’s at least another thousand static devices in my building alone that may not have been in use at the time that will shit the bed tomorrow morning.
The same thing applies to our much larger sister companies interstate. So that’s another 6,000 or so devices.
The two largest energy retailers were affected too, so that’s another 5,000 devices at a conservative estimate.
Then there’s all the self-service checkouts that went down across Australia. I have no idea how many there are, but if every Coles and Woolworths has ten of them, that’s another ~40,000 devices.
That’s just the organisations that I am personally aware of as being affected in Australia and can get ballpark figures for.
Obviously Microsoft are getting their figures from the auto-reportimg that happened on each crash, but it really does seem like it’s too low.
It’s beyond time to diversify our IT infrastructure. Enough with sticking everything “in the cloud” and paying for software (and devices!!) we don’t own.
So, those numbers all account for about 54,000 of the 8.5 million devices. Using fairly generous rounding, that still leaves approximately 8.5 million more devices.
A million is a lot.
Way to miss the point. That’s 54,000 that one person knows of across a small handful of organisations in one small country. I’m not even including the dozens more organisations I know were affected but can’t come up with a ballpark figure for.
Yknow I almost majored in IT/anything in that realm. Real glad I didn’t right now. And most other times, but especially right now.
If you had majored in IT you would know that this Crowdstrike thing is an easy, though somewhat tedious, fix. There’s honestly far more annoying problems that IT people have to content with.
Like justifying staffing and budgets. Fuck office politics.
I’m well aware that it’s not a complicated fix, I’m more than capable of doing it. Being a guy on an understaffed IT team in an office of hundreds right now sounds fucking miserable.
Not really. It’s a ton of overtime, the problem is not my fault, and no one can yell at me for taking too long because there’s no way to get it done faster.
If you want to talk about a giant pain in the ass look at what happens when a malicious virus runs rampant in an office. Then you have to clean each computer individually, sometimes having to wipe and reload whole machines. Which can take fucking hours because you have to update each computer after you do the wipe and reload. Even if you’re working from images there’s going to be at least a half a dozen updates if not more waiting to be redownloaded and reinstalled. And company bosses tend not to think it takes all that long to do that and therefore blame you for the delay in getting everyone up and running. So I’d rather them be mad at somebody else for the extreme downtime, like Crowdstrike.
How many systems in the world’s military went down, you know in war machines of Russia and Israel and Ukraine?
Those computers don’t have auto update enabled
Absolutely that. For networks that matter, patches are usually tested independently. While I wouldn’t trust the average military command to do patch testing, any civilian/corporate contractors absolutely would, because money. (Microsoft is likely at the top of that stack…)
There are other conditions as well. EDR infrastructure, if it exists, would need to be isolated on a “Government cloud” which is a different beast completely. Plus, there are different levels of networks, some being air-gapped.
CrowdStrike’s channel file updates were pushed to computers regardless of any settings meant to prevent such automatic updates, Wardle noted.
I work at an enterprise software company and have some well known, security conscience customer. The above is only true for us humans, if you have the money, you can dictate whatever the fuck you want.
Normally I would agree however this doesn’t appear to be a Microsoft update but a CrowdStrike update. Given that everyone is worried about ransomware etc.