The legal basis of legitimate interests: Focus sheet on measures to implement in case of data collection by web scraping
The collection of data accessible online by web scraping must be accompanied by measures to guarantee the rights of data subjects.
Reminder on the doctrine of the CNIL
The practice of data scraping has significantly increased, especially with the rapid expansion of generative AI systems which use vast amounts of freely accessible online data. However, the use of such techniques poses inherent risks to the rights and freedoms of individuals, who do not control the re-use of their online accessible data. The generalization of web scraping has thus transformed the nature of internet use, enabling third parties to read, collect, and reuse any data published online by individuals, which poses significant and unprecedented risks for data subjects.
The CNIL has regularly called for vigilance regarding these practices and issued a series of recommendations to be respected to implement them. The CNIL has also advocated for the creation of an ad hoc legislative framework (see, in particular, the CNIL opinion of December 15, 2022 on the “Polygraphe” project, in French), aiming at securing entities using such practices, regulating them and protecting personal data freely accessible online.
In some cases, the CNIL has deemed such practices prohibited in the absence of a legal framework (in particular where processing operations are carried out by competent authorities for the purpose of detecting infringements). Conversely, they have been accepted in other cases, provided that stringent safeguards were implemented, for example for searching the internet for information leaks (RIFI).
For the moment, in the absence of a specific legal framework, this how-to sheet recalls the controllers’ obligations and specifies the conditions under which such processing may be implemented for developing an AI system.
The legality of web scraping depends in particular on the possibility of relying on a valid legal basis. Collecting data accessible online to create a training dataset may be based on the legitimate interest, provided that it complies with the conditions set out in the legitimate interest how-to sheet.