AI training isn’t any longer a tech experiment you read about in the news. It’s happening in real-time and probably scanning your writing right now.
All your blog posts, chapter drafts, or articles you’ve published or shared online are precisely the sort of material AI crawlers vacuum up every minute of the day.
I know this because my Cloudflare stats show that my site is hit between five and eight thousand times every day by AI crawler bots.
If AI bots are accessing my writing at this rate, there’s a very good chance they’re doing the same with yours.
Good bots, bad bots, and a broken deal
Crawler bots are not all the same because they range from useful to annoying to downright dangerous.
For many years, you probably had no problem with good bots like Googlebot or Bingbot crawling your site, because when they indexed your content, they sent you traffic in return.
It was a kind of unwritten deal or understanding. You were happy to let them scan your site and copy some of your content, and in return, they helped readers find your site.
It was a happy balance that helped site owners grow their traffic and audience.
But AI crawlers have broken that long-standing deal with how AI uses your content.
Without your permission, they are scooping up your words in massive amounts, but giving you nothing back in return: no rankings, no referrals, no visibility.
They pull your writing into training datasets without giving you any credit. And rarely, if ever, link back to your original content.
In other words, these bots are using your work without giving you the fair exchange that used to exist with search engines.
However, it leaves you, your writing, and your site in a bind. AI training bots aren’t all the same.
If you try to block them, you could lose what’s left of your visibility on search engines and the possibility of new traffic from AI tools like ChatGPT and Perplexity. Allow them, and your words become fuel for AI training data for nothing much in return.
Proof AI training is using your writing (My data)
I check my Cloudflare dashboard almost every day, and it’s always an eye-opener.
The AI Crawl Control tab usually lists between 5,000 and 8,000 hits per day from bots such as Claude, Perplexity, ChatGPT, and, of course, Google and Bing. But it’s not a spike; it’s regular daily activity.
That’s thousands of requests every day, just from AI bots, scanning through all the pages of my site, and without my permission or any option to say no, thank you.
Here’s what my data looks like today, which is a little quieter than usual.
Some bots clearly identify themselves with names like GPTBot, ClaudeBot, or PerplexityBot, which appear in my logs.
Here’s a clearer list of the AI bots and the number of times they hit my site today.
However, many other bots aren’t as transparent. Some disguise themselves as ordinary traffic, which means the true numbers are even higher.
In fact, I discovered a new one today that WordFence caught after it had hit my site over 400 times in under ten minutes.
If this is what is happening on my site, it’s almost certain the same thing is happening on yours.
And it’s not just websites that are vulnerable because writers naturally wonder about ebooks. But this is tricky.
No, AI bots can’t crawl Kindle ebooks because they are behind a paywall.
However, they can access the preview content as well as any pirated versions that are prolific on the Internet.
Only one thing is certain: AI crawlers are harvesting your words without permission, and your writing is now part of the massive training sets fuelling the next versions of AI.
How to check if AI is using your writing
If you’re wondering whether AI bots are accessing your site for content scraping, there are a few ways you can try.
For Cloudflare users, the AI Crawl Control tab will show you which AI bots are hitting your site and how many times. You can also block them from the page, but it’s anyone’s guess how effective it is.
Without access to Cloudflare, you can usually find AI crawlers by checking the server logs on your hosting server. Search for names like GPTBot, ClaudeBot, or PerplexityBot. Even searching using the word “Bot” will work.
Here’s a quick grab from my logs, searching for the Claude bot. You can see the bot hit my server 158 times in 24 hours.
Another option is to see if your writing appears in leaked or publicly available AI datasets.
The most infamous example is Books3, a huge collection of pirated books that was used for early AI training data.
If you know your way around datasets, you can search lists of titles to see if your book appears.
If you’re active on publishing platforms, blogs, social media, or forums, you can assume that your content has been crawled.
AI bots don’t see any difference between a professional essay, a personal blog post, or a quick social media post because they scrape anything and everything they can find.
Unfortunately, there are no rules or laws governing AI bots and data scraping. It’s still a free-for-all that ignores copyright, privacy, and data protection.
Should you try to block AI crawlers?
If you want to stop some or all AI bots from training on your writing, there are a few options you could try.
For Cloudflare users, it’s relatively easy. You can block specific AI crawlers directly from the AI Crawl Control tab by using the slider to set Block or Allow.
For WordPress sites, security plugins like WordFence can help you spot unusual bot activity and block suspicious traffic.
You can also use robots.txt directives or .htaccess server rules to tell bots not to crawl your site.
But here’s the kicker: many, most, or even all AI crawlers will probably ignore your signals.
Also, bear in mind that blocking comes with trade-offs.
You might block bots that send traffic or improve your search visibility, meaning that you will lose potential site visitors.
Also, search engines don’t differentiate between their traditional search bots and AI bots.
So you can’t avoid your writing from being used for AI search results if you want to appear in the standard “ten blue lines” of listings.
You can try, but the truth is that there’s no solution yet to protect your writing from AI.
Summary
It’s only recently that Cloudflare became the first Internet service to at least give site owners some degree of control against AI bots.
Is it effective? I’m not sure, but at least it gives an easy window into the activity.
The hard truth is that there is little you can do about the increasing use of AI training bots and the general proliferation of AI tools and platforms that seem to spring up almost every day.
All you can do is monitor, which I know is not a great help.
What’s the best course of action you can take? Keep writing for your readers, no matter what form it is, even if organic traffic is harder to get nowadays.
It’s so easy to get caught up in the maelstrom of tech changes and forget about your prime aim.
When you are up to your neck in alligators, it’s hard to remember that your objective is to drain the swamp.
Related Reading: Why Using AI To Write For You Is A Terrible Idea