Update Regarding Pawb.Social Media Loss

TL;DR: ~60% of media data was recovered by retrieving cached images from Cloudflare and scraping The Wayback Machine. Over the coming days and weeks, we will work on restoring this data.

Greetings everyone! I’m u/Southernwolf, the Moderator (technically Admin since it’s on Lemmy) mentioned in the previous post by u/Crashdoom. I wanted to provide an update on the data retrieval I was working on, and provide details on what it will take for us to get the recovered media data back online.

Initially using a script (created with the help of Qwen AI) to retrieve cached media data from Cloudflare, I had been able to recover ~33% of lost media. Which by itself is honestly not that bad, given the cache was already starting to decay away. It required using a VPN to hop to different places around the globe, but ultimately that is what allowed me to recover the amount of media I did from CF cache alone.

However, at the recommendation of @arcanicanis@were.social, I modified my script with Qwen to scrape the Wayback Machine for the rest of the missing images. This took a while, as I couldn’t do more than one request every 2 seconds without hitting their rate limit, but after some 5 hours this was complete. As a result, this is the final tally of the recovered media:

Recovery report generated:
  - Total entries in CSV: 6697
  - Images recovered: 4080
  - Images missing: 2617
  - Recovery rate: 60.92%
  - Total size: 2953.08 MB (2.88 GB)

Honestly, this is a phenomenal result! Far greater than I ever expected could be recovered of the media data. It’s not perfect, but this is far, far greater than I could have hoped for, and I can be more than satisfied in rescuing that large an amount of the lost media.

Now, with the media we have recovered, the process will turn to actually getting the images plugged back into the instance. This won’t necessarily be a simple process, due to the nature of how Pict-rs (the media database that Lemmy relies on). One can’t easily insert images back into it, as it uses rather large hash trees to store everything… So we will have to investigate ways to work around this. There are some potential simple solutions (such as just making endpoints manually for the images and hoping it doesn’t break Pict-rs) or some rather complex ones (such as switching our media database over to an entirely different system such as Postgres).

Which solution turns out to work best will determine how long it will take to get the lost media back online. But you can expect a wait of likely several days at minimum, to possible a few weeks. Once we have an idea of what will work, another update will get posted to let our users know.

Thank you for your patience with us as we work to fix this issue!