OpenAI’s GPTBot crawls the web for content that can be used by AI models. If you do not want this, you can block the bot.
The content that GPTBot visits can be used to improve future AI models, according to OpenAI. Those who give GPTBot access to their content are helping to make AI models more accurate, capable, and safe, the company writes.
Block GPTBot from crawling your site
If you do not want to share your content with OpenAI’s models for free, you can block GPTBot. By configuring “User-agent: GPTBot,” you can either block the bot from visiting your site altogether or from visiting individual folders or categories on your site. Similar to blocking a Google crawler, you can control GPTBot by adding it to your robots.txt with the following commands
User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +
According to OpenAI, content behind paywalls, pages that request personal identification information, or that violate OpenAI’s content guidelines are automatically filtered out. Full instructions are available here.
ChatGPT and the Content Dilemma
With the launch of ChatGPT’s web browsing feature, OpenAI announced that website owners such as publishers could block the crawling bot if they did not want their content to be used within or for ChatGPT.
Blocking the bot, however, means not being present in a potentially emerging content ecosystem – a dilemma similar to (non-)indexing in Google search, where content providers inadvertently become both suppliers to and financially dependent on a third-party ecosystem.
In the case of chatbots, however, the starting position for content providers is even less favorable: While search engines are (largely) designed to direct searchers to sites where they can provide value to the site operator, chatbots are optimized to provide searchers with the most direct and comprehensive answers possible directly in chat. This almost exclusively benefits the provider of the chatbot.
OpenAI does not currently offer web browsing, following the discovery that ChatGPT browsing could partially read content behind paywalls and pull it into the chat for free. It is not known when the browsing plugin will be back online. Perhaps OpenAI is concerned about further legal repercussions for the reasons mentioned above.
Meta, Microsoft and Google also train their chatbots with copyrighted material and pull content from websites into their chatbots without further consent. They are reportedly in talks with publishers to charge billions for the use of their content.
So far, major chatbot providers like Microsoft have paid lip service, at best, to keep the web ecosystem open. Google’s new AI search is designed to keep users in the Google ecosystem much longer than traditional web search.