Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model.

The crawler, named the Meta External Agent, was launched last month, according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.

A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.

While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.

Earlier this year, Mark Zuckerberg, Meta’s cofounder and longtime CEO, boasted on an earnings call that his company’s social platforms had amassed a data set for AI training that was even “greater than the Common Crawl,” an entity that has scraped roughly 3 billion web pages each month since 2011.

  • GarrulousBrevity@lemmy.world
    link
    fedilink
    arrow-up
    1
    ·
    edit-2
    2 months ago

    Does that mean this new bot is ignoring sites’ robots.txt files? The Internet works because of web crawlers, and I’m not sure how this one is different

    Edited to add: Apparently one would need to add Meta-ExternalAgent to their robots file unless they had a wildcard rule, so this isn’t as widely blocked by virtue of being new. Letting it run for a few months before letting anyone know it exists is kinda shady.

    • Aniki 🌱🌿@lemmings.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 months ago

      Crawling the web has fuck all to do with the function of the internet. Most crawlers are useless at most to downright disrespectful.

        • Aniki 🌱🌿@lemmings.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          2 months ago

          The internet is not a search engine, and no - search engines are not generative ai. That’s new.

          Do you have any idea how many content bot crawlers there are? Most of the corporate sites I host at work are serving content to bots more than half the time.

          Do you know altivista still has bots??

          When was the last time you used that search engine?

          • GarrulousBrevity@lemmy.world
            link
            fedilink
            arrow-up
            0
            ·
            2 months ago

            I guess I don’t really see the problem with that though. There are configuration levers you could be pulling, but those sites you’re hosting are not. There are lots of shady questions about how these models are getting training data, but crawlers have a well defined opt out mechanism.

            The web would not be what we know it as without them, because it’s how you find sites. Why shouldn’t Alta Vista have one? I don’t object to what Alta Vista does with the data.

            • Aniki 🌱🌿@lemmings.world
              link
              fedilink
              arrow-up
              1
              ·
              2 months ago

              Mate we have absurdly restrictive robots.txt including a custom WordPress plugin that automatically generates the file and the bots don’t give a fuck.