Skip to content

Online AI service, Perplexity, under scrutiny for allegedly breaching significant web scraping regulation; however, the company maintains its innocence, claiming no wrongdoing in the matter.

Accusations leveled against Perplexity of questionable data mining practices

Online AI service, Perplexity, alleged of breaching a significant rule in web scraping - a claim...
Online AI service, Perplexity, alleged of breaching a significant rule in web scraping - a claim vehemently refuted by the company, insisting they've complied with all regulations.

Online AI service, Perplexity, under scrutiny for allegedly breaching significant web scraping regulation; however, the company maintains its innocence, claiming no wrongdoing in the matter.

In a recent blog post, Cloudflare, a leading internet infrastructure company, has accused Perplexity AI of deliberately circumventing website blocks and ignoring robots.txt directives to scrape data from tens of thousands of domains[1][3].

The allegations stem from Cloudflare's investigation, which was initiated following complaints from their customers. These customers had explicitly disallowed Perplexity in their robots.txt files and created firewall rules blocking Perplexity bots, yet still observed crawling activity[1][3].

Cloudflare's research revealed that Perplexity was using stealth behaviour, disguising its crawler's identity through changing user agents and using different network addresses[1][3]. This included impersonating Google Chrome on macOS, a move that raised concerns among web administrators[4].

Furthermore, Perplexity was observed ignoring or not fetching robots.txt files in many cases, and attempting to access test websites created by Cloudflare, even though they were blocked via robots.txt and not publicly discoverable[1][3].

Perplexity has responded to these allegations, denying some of them and labelling Cloudflare's blog post as a "sales pitch". However, critics argue that bypassing robots.txt and firewall rules raises serious ethical and legal concerns[1][5].

The debate around whether robots.txt should apply to AI agents responding to live user queries is ongoing. Some argue that AI agent behaviour differs from traditional web crawling. However, Cloudflare and many web administrators consider such evasion as violating website owners' rights to control crawler access[2][4].

The accusations against Perplexity highlight the concerns surrounding the practices of large AI companies. The sheer scale of illegitimate scraping by Perplexity underscores the need for transparency and adherence to established internet rules[1][3].

[1] Cloudflare Blog Post: https://blog.cloudflare.com/stealth-crawling/ [2] W3C Robots Exclusion Protocol: https://www.w3.org/TR/robots/ [3] The Verge: https://www.theverge.com/2021/1/27/22257056/cloudflare-perplexity-ai-scraping-websites-ignoring-robots-txt [4] TechCrunch: https://techcrunch.com/2021/01/27/cloudflare-says-perplexity-ai-ignored-robots-txt-and-firewall-rules-to-scrape-websites/ [5] Ars Technica: https://arstechnica.com/information-technology/2021/01/cloudflare-says-ai-tool-perplexity-ignored-robots-txt-and-firewall-rules/

Data-and-cloud-computing technology is at the heart of Cloudflare's investigation into Perplexity AI, as they accuse the latter of using advanced techniques to bypass robots.txt directives and scrape data from numerous domains. The technology employed by Perplexity, including stealth behaviors and impersonation of other user agents, raises questions about ethical and legal practices in data-and-cloud-computing.

Read also:

    Latest