AI system Perplexity under fire for allegedly harvesting content from websites without consent, using unregistered IP addresses.
In the rapidly evolving world of artificial intelligence (AI), a growing concern for major publishers is unauthorised content scraping by AI bots. AI companies like Perplexity, Anthropic, and OpenAI have been at the centre of these concerns, with some striking deals with publishers to access their content [1].
Recent reports have highlighted an increase in AI scraping activities. For instance, in Q1 2025, 26 million AI scrapes bypassed robots.txt files, and the share of bots ignoring these files increased from 3.3% to 12.9% during the quarter [2]. Notably, Perplexity, an AI search startup, has been identified as operating outside of its official IP range and using different Autonomous System Numbers (ASNs) to evade address-based blocking [3].
In response, publishers are adopting a multi-layered approach to prevent such unauthorised scraping. This strategy includes user-agent blocking, access monitoring and rate-limiting, content delivery via dynamic JavaScript, and strengthening legal terms and enforcement readiness [1][5].
User-agent blocking involves identifying and blocking user agents such as GPTBot, ClaudeBot, CCBot, and PerplexityBot to prevent access. Access monitoring and rate-limiting help control the number and frequency of requests from any user to limit data extraction volume. Content delivery via dynamic JavaScript serves content after the page load, a technique most bots struggle to parse effectively [1][2].
Legal measures include clearly stating rules against unauthorised scraping in the website’s terms of service to create enforceable agreements and pursuing takedown or legal action if needed [1][5]. Publishers are also strengthening their terms of use and data-sharing policies to explicitly prohibit unauthorised scraping and downstream use to deter commercial aggregators [5].
Additional technical measures, though not always foolproof, include proxy detection, IP blocking, CAPTCHA challenges, and bot behaviour analysis [3]. However, savvy scrapers may employ headless browsers, IP rotation, and distributed scraping architectures to evade these measures.
Compliance with the Robots Exclusion Protocol is voluntary, and the rising lack of compliance has led companies like Cloudflare to offer defensive technology to publishers [4]. Cloudflare, a network infrastructure company, has entered the bot gatekeeping business and claims that Perplexity bots ignore websites' no-crawl directives [4].
The future of this scenario is uncertain. The rise of paywalls could potentially impact AI firms, while the free web might become a sea of synthetic AI slop. The AI bubble may also collapse under the weight of unrequited capital expenditure [6]. It remains to be seen whether the cloud giants will pay for long-tail content or if AI firms can survive the impact of paywalls.
References:
[1] [Website A] (URL) [2] [Website B] (URL) [3] [Website C] (URL) [4] [Website D] (URL) [5] [Website E] (URL) [6] [Website F] (URL)
In a separate development, TollBit published the Q1 2025 State of the Bots report, which found an 87% increase in scraping during the quarter [7]. The report also highlighted that RAG-oriented scraping has surpassed training-oriented scraping [7].
Perplexity launched its Publisher Program to pay participating partners, and the company's bots have been identified as trying to disguise their content-scraping activities [8]. The ratio of scrapes to referred human site visits for Bing was 11:1, while the rates for AI-only apps were: OpenAI 179:1, Perplexity 369:1, and Anthropic: 8692:1 [9].
It's unclear whether the cloud giants will pay for long-tail content or if AI firms can survive the impact of paywalls. The future may involve a business model that works for both AI firms and publishers, publishing retreating behind subscription walls, the free web becoming a sea of synthetic AI slop, or the AI bubble collapsing under the weight of unrequited capital expenditure.
References:
[7] [Website G] (URL) [8] [Website H] (URL) [9] [Website I] (URL)
- In the realm of artificial intelligence (AI) and technology, concerns regarding security, particularly unauthorized content scraping by AI bots, have risen notably among major publishers.
- Unexpectedly, some AI companies like Perplexity, Anthropic, and OpenAI have been embroiled in these concerns, entering agreements with publishers to access their content.
- Technology companies are increasingly turning to general-news and financial software to strengthen their security measures against AI scraping.
- As a response, they're adopting a multi-layered approach, which includes legal measures such as stating rules against unauthorized scraping within Terms of Service, along with technical measures like proxy detection and IP blocking.
- Cloud technology services, like Cloudflare, are providing defensive technology to help publishers maintain the security of their datacenters and data against unauthorized AI scraping.