The rise of AI bots, scrapers and crawlers that collect web data to train models has raised concerns among content creators.
Many of these tools operate without transparency, sometimes posing as legitimate browsers, and control over how the content is used is limited.
How can you protect your content from these bots?
There are several tools that allow users to block AI bots with a single click, however here we are going to see how you can do it manually by filling out your physical robots.txt file.
If you need to know what robots.txt is, this article from Cloudflare is very good: What is robots.txt? | How a robots.txt file works
To block known bots, simply add the appropriate instructions to your robots.txt file using your hosting’s file manager.
Instructions for adding to the robots.txt file
An open list of web crawlers associated with AI companies and LLM training on blocking them can be found in this GitHub repository. See information about the listed crawlers and the
FAQ .
Copy the instructions to your robots.txt file.
The instructions, as of the date of publication of this post, to add are:
User-agent: Amazonbot User-agent: Applebot User-agent: Applebot-Extended User-agent: Bytespider User-agent: CCBot User-agent: ChatGPT-User User-agent: Claude-Web User-agent: ClaudeBot User-agent: Diffbot User-agent: FacebookBot User-agent: FriendlyCrawler User-agent: GPTBot User-agent: Google-Extended User-agent: GoogleOther User-agent: GoogleOther-Image User-agent: GoogleOther-Video User-agent: ICC-Crawler User-agent: ImagesiftBot User-agent: Meta-ExternalAgent User-agent: Meta-ExternalFetcher User-agent: OAI-SearchBot User-agent: PerplexityBot User-agent: PetalBot User-agent: Scrapy User-agent: Timpibot User-agent: VelenPublicWebCrawler User-agent: YouBot User-agent: anthropic-ai User-agent: cohere-ai User-agent: facebookexternalhit User-agent: img2dataset User-agent: omgili User-agent: omgilibot Disallow: /
Is there a plugin to do this?
Of course it does. Block AI Crawlers blocks known AI bots, scrapers, and crawlers. While the plugin adds these flags, it is the crawlers themselves’ responsibility to respect these requests.
The plugin adds directives to your robots.txt file to tell AI crawlers not to index your site. It also adds the noai meta tag to your site header to do the same.
You can add the noai meta tag by creating a new script using the Insert Headers and Footers Code – HT Script plugin with the following instruction:
<meta name="robots" content="noai, noimageai" />
The Block AI Crawlers plugin has the advantage of list updating but only works if you use the WordPress virtual robots.txt. If you have a physical robots.txt file on your web server, you won’t be able to activate this plugin. [We haven’t tested it with the Yoast SEO plugin which also creates a virtual robots.txt file but it’s likely to work as well.]
Why block AI bots
Cloudflare gives compelling reasons to use an AI crawler blocking mechanism on your website: Declare your independence: Block AI bots, scrapers, and crawlers with a single click .
At Blogpocket, we believe that this is a reasonable option if you want to protect your content for ethical reasons. This is specified in the manifesto that we signed, We use AI responsibly . We understand that training AI models without user consent conflicts with the user’s rights to privacy, among other things.
We are therefore in favour of an ethical and responsible use of AI and, in this sense, we have implemented AI bot blocking in Blogpocket as explained in this article.
ChatGPT (less than 10%) was used to write this post. Images were generated using Copilot Designer’s AI. At Blogpocket, we believe in ethical and responsible use of AI.
Leave a Reply