[Maintenance] Feb 7 - Mastodon Data Migration

Crashdoom (he/him)@pawb.social · edit-2 2 years ago

[Maintenance] Feb 7 - Mastodon Data Migration

Kovukono@pawb.social · 2 years ago

I wish you guys had posted that Ko-Fi link in the sidebar. I’d been looking for a way to donate for a bit, but been too lazy to contact you guys to find out.

Crashdoom (he/him)@pawb.social · 2 years ago

Didn’t even think to do that! Added to the sidebar :3

Crashdoom (he/him)@pawb.social · 2 years ago

Status update

Both instances are back online again! We’re currently transferring cached media from remote instances to the local storage, so avatars, emojis, and older attachments may currently appear as broken images.

As of Feb 7th at 10:45 PM Mountain Time, pawb.fun has re-generated all feeds, while furry.engineer is continuing with an estimated 25 minutes to go. We’re also re-generating the ElasticSearch indicies which power the full-text search system and expect that to continue through the night.

proto_phantom@pawb.social · 2 years ago

It appears I’m still having some issues with pictures and media, doesn’t seem like it’s limited to a specific instance or time though.

Might this be due to either the migration, today’s issues or something else?

Draconic NEO@pawb.social · edit-2 2 years ago

Having issues with Emojis on the (Pawb.fun) instance, all external ones, from other users appear broken, they show up as the text and glitch when hovered over. Reached out on Mastodon earlier about this, thought I’d also message here too.

LiquidParasyte@pawb.fun · 2 years ago

@Draconic_NEO @crashdoom ditto here, but it seems to be more of an issue with servers we regularly interact with, like tech.lgbt, and less so for uncertain others.

huxley@pawb.social · 2 years ago

Looks like furry.engineer is down?

Stefen Auris@pawb.social · 2 years ago

I’m seeing the same here, something about an Argo tunnel error. @crashdoom@pawb.social

Crashdoom (he/him)@pawb.social · 2 years ago

Aware and investigating!

Stefen Auris@pawb.social · 2 years ago

and that’s why you’re the best <3

liquidparasyte@pawb.social · edit-2 2 years ago

pawb.fun as well. Something got fucky wucky during the migration, it seems.

natebluehooves@pawb.social · 2 years ago

Correct! to give a bit of background while I wait for backups…

last night we had what appears to be an out of memory error. Our cloudflare tunnels broke around the same time that the internet went out (probably related), and we also didn’t have our nodes configured to keep some ram reserved to allow kubernetes to keep running. Additionally, we still only had 1 replica of the data for furry.engineer and pawb.fun that we were still building/downloading from other instances (mostly cached images).

so it was the perfect storm. node 1 runs out of memory and basically crashes, node 2 then tries to pick up the services that are suddenly offline, immediately causing it to run out of memory and crash. There’s only one copy of the data, so nothing offline to check for corruption against. all the storage with 2 replicas was unaffected.

I’ve done an announcement post on the telegram channel to try and keep people appraised, but this restore is going to take another couple hours probably because I’m trying not to repeat my mistakes by setting things to 1 replica or skipping backups for expediency. My impatience pretty directly caused this issue.

Vincent Hayes@pawb.social · 2 years ago

SysAdmin lesson learned, always make the backups :3

natebluehooves@pawb.social · 2 years ago

Lessons do stick around when you have to learn the hard way!

Exec@pawb.social · 2 years ago

node 1 runs out of memory and basically crashes, node 2 then tries to pick up the services that are suddenly offline, immediately causing it to run out of memory and crash

Oof, that’s pretty much a cascading failure

natebluehooves@pawb.social · 2 years ago

Actually yes. Recovery was slow and painful, but I have policies in place to handle these failures now. I’m sure we will find another failure mode as we go forward!

Spitfire@pawb.social · 2 years ago

Moar storage!

Frosty@pawb.social · 2 years ago

Will the data be synched or backed up off site?

Crashdoom (he/him)@pawb.social · 2 years ago

Yes, we’ll be maintaining:

Multiple replicas across different disks (local)
Hourly and daily snapshots (local)
Regular off-site backups for disaster recovery

natebluehooves@pawb.social · 2 years ago

Local hardware horse here!

To elaborate a bit, the storage replicas will span three physical servers in realtime, all of which get snapshots hourly in case we need a rollback, and full backups weekly to a fourth system on mechanical drives with 2-disk failure tolerance. This should mean that data loss requires 4 simultaneous system failures.

We have a tape library for automated tape backups, but can’t afford a drive upgrade just yet to make it make sense. The drives are often several thousand dollars, but the tape media is cheap.

Offsite backups are currently in the works, though if anyone has recommendations I would love to add them to our list for consideration.

If anyone has additional questions or suggestions I would be happy to answer tomorrow!

liquidparasyte@pawb.social · 2 years ago

Sorry to necro this, but a few lingering content issues are still lingering. A lot of posts from the last 30 days previously fetched still don’t load, and our side of some instances refuse to load their emoji (most notably tech.lgbt).