How to block bots, scrapers and AI crawlers

3baa0d77-ea32-45b9-a2c6-0120b7266f30 How to block bots, scrapers and AI crawlers

The rise of AI bots, scrapers and crawlers that collect web data to train models has raised concerns among content creators.

Many of these tools operate without transparency, sometimes posing as legitimate browsers, and control over how the content is used is limited.

How can you protect your content from these bots?

There are several tools that allow users to block AI bots with a single click, however here we are going to see how you can do it manually by filling out your physical robots.txt file.

If you need to know what robots.txt is, this article from Cloudflare is very good: What is robots.txt? | How a robots.txt file works

To block known bots, simply add the appropriate instructions to your robots.txt file using your hosting’s file manager.

Instructions for adding to the robots.txt file

An open list of web crawlers associated with AI companies and LLM training on blocking them can be found  in this GitHub repository. See information about the listed crawlers  and the 
FAQ .

Copy the instructions to your robots.txt file.

The instructions, as of the date of publication of this post, to add are:

User-agent: Amazonbot
User-agent: Applebot
User-agent: Applebot-Extended
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT-User
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: Diffbot
User-agent: FacebookBot
User-agent: FriendlyCrawler
User-agent: GPTBot
User-agent: Google-Extended
User-agent: GoogleOther
User-agent: GoogleOther-Image
User-agent: GoogleOther-Video
User-agent: ICC-Crawler
User-agent: ImagesiftBot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: OAI-SearchBot
User-agent: PerplexityBot
User-agent: PetalBot
User-agent: Scrapy
User-agent: Timpibot
User-agent: VelenPublicWebCrawler
User-agent: YouBot
User-agent: anthropic-ai
User-agent: cohere-ai
User-agent: facebookexternalhit
User-agent: img2dataset
User-agent: omgili
User-agent: omgilibot
Disallow: /

Is there a plugin to do this?

Of course it does. Block AI Crawlers blocks known AI bots, scrapers, and crawlers. While the plugin adds these flags, it is the crawlers themselves’ responsibility to respect these requests.

The plugin adds directives to your robots.txt file to tell AI crawlers not to index your site. It also adds the noai meta tag to your site header to do the same.

You can add the noai meta tag by creating a new script using the Insert Headers and Footers Code – HT Script plugin with the following instruction:

<meta name="robots" content="noai, noimageai" />

The Block AI Crawlers plugin has the advantage of list updating but only works if you use the WordPress virtual robots.txt. If you have a physical robots.txt file on your web server, you won’t be able to activate this plugin. [We haven’t tested it with the Yoast SEO plugin which also creates a virtual robots.txt file but it’s likely to work as well.]

Why block AI bots

Cloudflare gives compelling reasons to use an AI crawler blocking mechanism on your website: Declare your independence: Block AI bots, scrapers, and crawlers with a single click .

At Blogpocket, we believe that this is a reasonable option if you want to protect your content for ethical reasons. This is specified in the manifesto that we signed, We use AI responsibly . We understand that training AI models without user consent conflicts with the user’s rights to privacy, among other things.

We are therefore in favour of an ethical and responsible use of AI and, in this sense, we have implemented AI bot blocking in Blogpocket as explained in this article.

ChatGPT (less than 10%) was used to write this post. Images were generated using Copilot Designer’s AI. At Blogpocket, we believe in ethical and responsible use of AI.

Comparte en Mastodon

Icono de Mastodon


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Información básica sobre protección de datos Ver más

  • Responsable: Antonio Cambronero.
  • Finalidad:  Moderar los comentarios.
  • Legitimación:  Por consentimiento del interesado.
  • Destinatarios y encargados de tratamiento: No se ceden o comunican datos a terceros para prestar este servicio. El Titular ha contratado los servicios de alojamiento web a GreenGeeks que actúa como encargado de tratamiento.
  • Derechos: Acceder, rectificar y suprimir los datos.
  • Información Adicional: Puede consultar la información detallada en la Política de Privacidad.

This site uses Akismet to reduce spam. Learn how your comment data is processed.