A new web crawler launched by Meta last month is quietly scraping the web for AI training data

lemme in@lemm.ee · 2 months ago

A new web crawler launched by Meta last month is quietly scraping the web for AI training data

riodoro1@lemmy.world · 2 months ago

Fuck the planet, we need another one of those useless chatbots.

ImplyingImplications@lemmy.ca · 2 months ago

Just another billion parameters bro! I swear if we add another billion it’ll fix everything!

solsangraal@lemmy.zip · 2 months ago

the chatbots are there for them to pretend they’re doing something useful for the end user, instead of just creating an ever-increasingly detailed unique digital profile of each individual with thousands of data points in order to separate you from your money

Admiral Patrick@dubvee.org · edit-2 2 months ago

Ugh, fuck these and their tech bro creators so much. Not only is “AI” is enshittifying everything it touches, it’s even passively fucking up things it can’t touch.

With the line needlessly blurring between search engines and LLM models, and sites rightfully blocking AI scraper bots, I fully believe we’re on the cusp of a digital dark age. If you think search engines suck now, just wait until very little of the quality content on the internet is indexable because people don’t want it scraped for training data. Or if it is indexed, the actual content is locked up, requiring registration or otherwise no longer being easily accessible.

These “AI” tech bros are basically strip mining the internet while shitting where they eat (and maybe also pissing in the pool if I haven’t mixed enough metaphors for your liking). They’re exploiting what makes the internet great while simultaneously ruining it for the future.

For as long as search engines have existed, we had a deal going: search providers could crawl and index site data and show ads to support themselves and in exchange, sites gained visibility. Now they’re using those same scrapers to steal content for their own purposes while depriving the sources of traffic. They have broken the deal, and with it, the fundamental way the internet has worked for over 30 years.

I say it again: Fuck these AI-pushing tech bros and the horses they rode in on.

wizardbeard@lemmy.dbzer0.com · edit-2 2 months ago

strip mining the internet

That’s such a wonderfully succint way to describe the arc of tech companies over the last decade and a half.

And even earlier than that, I miss the days of actually “surfing” the net. Start on one page you know and get farther and farther down into webrings and personal pages linking to each other. Could really find some awesome things tucked away way back when.

eskimofry@lemmy.world · 2 months ago

These hypocritical assholes don’t want people acessing their own data on their websites and lay claim to it. Now they want to steal others’ data.

It would make my day on the day they get sued into oblivion for data theft.

Admiral Patrick@dubvee.org · 2 months ago

Not just data theft, but selling stolen goods (more or less).

They’re stealing content and using that to build a service that they sell and profit from.

henfredemars@infosec.pub · edit-2 2 months ago

The AI cat is out of the bag. How do they know they’re not feeding AI generated garbage into their models?

Actually I think I’m gonna go in my personal website and add 200 pages of locally generated LLM garbage with hidden links to those pages that only bots should follow.

BlackDragon@slrpnk.net · 2 months ago

How do they know they’re not feeding AI generated garbage into their models?

They don’t. Any popular place on the internet which lets users type text for people to publicly view is now full of AI trash. They’ve fucked it, this shit is just gonna spiral into progressively worse garbage

GarrulousBrevity@lemmy.world · edit-2 2 months ago

Does that mean this new bot is ignoring sites’ robots.txt files? The Internet works because of web crawlers, and I’m not sure how this one is different

Edited to add: Apparently one would need to add Meta-ExternalAgent to their robots file unless they had a wildcard rule, so this isn’t as widely blocked by virtue of being new. Letting it run for a few months before letting anyone know it exists is kinda shady.

Aniki 🌱🌿@lemmings.world · 2 months ago

Crawling the web has fuck all to do with the function of the internet. Most crawlers are useless at most to downright disrespectful.

GarrulousBrevity@lemmy.world · edit-2 2 months ago

Have you used a search engine? Crawlers are not generative AI.

Aniki 🌱🌿@lemmings.world · edit-2 2 months ago

The internet is not a search engine, and no - search engines are not generative ai. That’s new.

Do you have any idea how many content bot crawlers there are? Most of the corporate sites I host at work are serving content to bots more than half the time.

Do you know altivista still has bots??

When was the last time you used that search engine?

GarrulousBrevity@lemmy.world · 2 months ago

I guess I don’t really see the problem with that though. There are configuration levers you could be pulling, but those sites you’re hosting are not. There are lots of shady questions about how these models are getting training data, but crawlers have a well defined opt out mechanism.

The web would not be what we know it as without them, because it’s how you find sites. Why shouldn’t Alta Vista have one? I don’t object to what Alta Vista does with the data.

Aniki 🌱🌿@lemmings.world · 2 months ago

Mate we have absurdly restrictive robots.txt including a custom WordPress plugin that automatically generates the file and the bots don’t give a fuck.