A new web crawler launched by Meta last month is quietly scraping the web for AI training data

lemme in@lemm.ee · 2 months ago

A new web crawler launched by Meta last month is quietly scraping the web for AI training data

GarrulousBrevity@lemmy.world · edit-2 2 months ago

Does that mean this new bot is ignoring sites’ robots.txt files? The Internet works because of web crawlers, and I’m not sure how this one is different

Edited to add: Apparently one would need to add Meta-ExternalAgent to their robots file unless they had a wildcard rule, so this isn’t as widely blocked by virtue of being new. Letting it run for a few months before letting anyone know it exists is kinda shady.

Aniki 🌱🌿@lemmings.world · 2 months ago

Crawling the web has fuck all to do with the function of the internet. Most crawlers are useless at most to downright disrespectful.

GarrulousBrevity@lemmy.world · edit-2 2 months ago

Have you used a search engine? Crawlers are not generative AI.

Aniki 🌱🌿@lemmings.world · edit-2 2 months ago

The internet is not a search engine, and no - search engines are not generative ai. That’s new.

Do you have any idea how many content bot crawlers there are? Most of the corporate sites I host at work are serving content to bots more than half the time.

Do you know altivista still has bots??

When was the last time you used that search engine?

GarrulousBrevity@lemmy.world · 2 months ago

I guess I don’t really see the problem with that though. There are configuration levers you could be pulling, but those sites you’re hosting are not. There are lots of shady questions about how these models are getting training data, but crawlers have a well defined opt out mechanism.

The web would not be what we know it as without them, because it’s how you find sites. Why shouldn’t Alta Vista have one? I don’t object to what Alta Vista does with the data.

Aniki 🌱🌿@lemmings.world · 2 months ago

Mate we have absurdly restrictive robots.txt including a custom WordPress plugin that automatically generates the file and the bots don’t give a fuck.