Google will no longer back up the Internet: Cached webpages are dead

BrikoX@lemmy.zip · 2 years ago

Google will no longer back up the Internet: Cached webpages are dead

MataVatnik@lemmy.world · 2 years ago

People don’t realize how ephemeral information is. How much information from the internet you think will survive 200 years from now? My guess is not very much. Also all the digitized documents, which in some age they would have been on paper are now magnetic bits on a hard drive that have to be refreshed and copied for it to survive.

HAL_9_TRILLION@lemmy.dbzer0.com · 2 years ago

People don’t realize how ephemeral information is. How much information from the internet you think will survive 200 years from now?

On the one hand, what a tragedy. On the other hand, thank fuck.

MataVatnik@lemmy.world · 2 years ago

Right?

Car@lemmy.dbzer0.com · 2 years ago

It’s an interesting thought experiment. We could preserve specific data if we cared to. But as others have echoed, with dynamic content delivery systems, editable forum and social media posts, and in some cases, the ability to petition companies to delete your online persona… all of these mean that storing snapshots becomes a more complex problem.

So far we have storage media which is probably good for 100 years or so before the physical medium begins to degrade. We then have to ensure that connections (physical plugs, protocols) are maintained or available 100 years from now. Offline cold storage sites exist but aren’t storing information to preserve human history. Any data that’s been overwritten or lost to dead links on the web may be sitting on a tape in a warehouse somewhere, but unless you know where to look and have the right credentials, it might as well be lost to time.

/home/pineapplelover@lemm.ee · 2 years ago

Cached webpages were lowkey clutch. Helped with some reddit posts that had deleted posts I needed tech help with.

EmergMemeHologram@startrek.website · 2 years ago

While sucky, this feels inevitable.

With LLMs and the massive wave of spam coming out right now make caching content way more expensive. And then Google gains no value from this. Long tail spam attacks are already strangling google lately.

I think the only way to run a search engine in the mid 2020s is to download the data, process the page in memory, extract to metadata+embeddings and store only those. There’s no value in storing the rendered page offline for later analysis since you’re likely not doing that later analysis.

Internet Archive hopefully can fare better by being curated by humans and storing data infrequently when important, whereas Google needs to scan a lot of info frequently with nearly no human input.