Informing data subjects
Organizations that process personal data to develop AI models or systems must inform concerned data subjects. The CNIL specifies the obligations in this regard.
Ensuring the transparency of processing
The principle of transparency requires organizations that process personal data to inform data subjects so that they understand why and how their data will be used and can exercise their rights (rights of opposition, access, rectification, etc.).
This principle applies to any processing of personal data, regardless of whether the data is:
- directly collected from the data subjects: for example, in the context of a contract with voluntary actors to create a training dataset, when providing a service, in the context of a relationship between a citizen and a public body, etc.;
- or indirectly collected: for example, when data is collected on the Internet via file downloads, using web scraping tools or using application programming interfaces (APIs) made available by online platforms to re-users; when the information is obtained from institutional or business partners such as data brokers, or by reusing an existing dataset, etc.
Note: if the organization processing the data has not directly collected the personal data from the data subjects, it may be exempted from informing data subjects individually if such information is impossible in practice or would require disproportionate efforts.
What information should I provide?
In any case
Any organization that creates or uses a training datasets to develop an AI system based on personal data must inform data subjects about the following, regardless of whether the data was collected directly or indirectly:
- The organization’s identity and contact details (such as e-mail address, postal address or telephone number) and the means of contacting the data protection officer;
- the purpose and legal basis of the processing, including, where appropriate, details of the legitimate interest pursued if the processing is based on that legal basis;
- the recipients or at least the categories of recipients of the data, with, where appropriate, details of any planned transfers of those data to a country outside the European Union;
- the retention period of the data (or, if not possible, the criteria for determining it);
- the rights of data subjects (rights of access, rectification, erasure, restriction, right to portability, right to object or withdraw consent at any time);
- the right to lodge a complaint with the CNIL.
To note: While information on the retention period and the exercise of rights does not have to be provided systematically for all processing operations, it will be almost always required for the creation and use of training datasets. Indeed, this ensures that the processing is fair and transparent towards data subjects.
Furthermore, that information must include, where appropriate, the fact that the controller will not be able to identify individuals, including in order to respond to their requests to exercise their rights. In this case, the CNIL recommends indicating what additional information can be provided by individuals wishing to exercise their rights, to enable their identification.
In addition, in the case of indirect data collection
In the case of indirect collection, organizations must also provide:
- Categories of personal data (e.g. identities, contact details, images, social media posts, etc.);
- The source(s) of the data (including whether or not they are publicly available).
Organizations must provide sufficiently precise information on the source of the data in order to ensure fair and transparent processing. Such information must facilitate the exercise by data subjects of their rights over the source processing. When individuals are not informed individually, the accessibility of that information is essential to enable them to determine whether they are concerned by the processing in question.
The CNIL recommends that the controller inform individuals of the precise identity of each of the data sources used, unless this is materially impossible or requires disproportionate technical measures.
In the case of re-using a dataset or an AI model subject to the GDPR
The CNIL recommends, at a minimum, providing the means to contact the controller from whom the dataset was obtained. Additionally, a good practice is to link directly to the original controller’s website via a hyperlink, and to accompany the information with a concise and clear explanation of the conditions for data collection and annotation.
In the case of web scraping websites
The CNIL recommends, at a minimum, providing the categories of source sites concerned, and if possible the domain names and URLs of the web pages concerned.
Individuals will be able to obtain more precise information by exercising their right of access (see the dedicated how-to sheet for more information).
Furthermore, Article 53 of the European AI Act provides for an obligation for providers of general purpose AI models to draw up and make publicly available a sufficiently detailed summary about the content used for training the model, in accordance with a template to be provided by the AI Office, in order to facilitate parties with legitimate interests to exercise and enforce their rights. This includes, for example, listing the main datasets or collections used to train the model, such as large private or public datasets or private data archives, and giving details of the other data sources used (recital 107 of the AI Act). Controllers subject to this obligation may usefully refer to this summary or include it in their information notices.
On AI models whose processing is subject to the GDPR
Training an AI model can sometimes cause it to “memorize” part of the training data (see the ‘Questionnaire on the application of the GDPR to AI models’). Where the model is found to be subject to the GDPR, individuals should be informed.
The provider of the AI model or system should then specify the model-specific information, including:
- the recipients or at least the categories of recipients of the model (for example, its users for a system marketed under license or “as a service”, or the categories of persons likely to download the model published in open source), with, where appropriate, details of the intended transfers of that data to a country outside the European Union;
- the retention period of the model (or, if not possible the criteria for determining it) if it differs from the retention period of the training data;
- the rights of data subjects on the model, in accordance with the conditions described above and in the rights management form, such as:
- the right of access;
- the right of rectification;
- the right to erasure;
- the right to restriction of processing;
- the right to object or withdraw consent at any time.
As a good practice, the supplier is also recommended to specify:
- the nature of the risk associated with reconstructing data from the model, such as the risk of data regurgitation in the case of generative AI;
- the measures taken to limit those risks, and the existing redress mechanisms if such risks arise, such as the possibility of reporting an occurrence of regurgitation to the organization.
When should the information be provided?
Where the controller collects the training data directly from the data subjects, it must inform the data subjects at the time of such collection.
Conversely, in the case of indirect data collection, the organization must inform the data subjects as soon as possible, and at the latest at the time of the first contact with them or at the time of the first communication of the data to another recipient. In any case, the organization must inform the data subjects within a period not exceeding one month after the date on which it obtained their data.
As a good practice, and depending on the nature and extent of the risks associated with the memorization of personal data in the model, the CNIL invites organizations to respect a reasonable period of time between informing individuals that their data are contained in a training dataset and the training of a model (by itself or following the dissemination of the dataset). This good practice enables data subjects to exercise their rights during this period given the technical difficulties in exercising these rights on the model itself afterwards and the risks this entails (in particular depending on the nature of the data that has been memorized).
How do I provide the information?
Ensuring the accessibility of the information
Data subjects must not encounter difficulties to access or understand the information.
Information notices must be distinguished from other information unrelated to data protection (General terms and conditions, legal notice, etc.). In this regard, while the information published on the websites of controllers may relate to many processing operations and concern different categories of persons (e.g. users of the website in question, data subjects involved in the development phase of AI systems, data subjects involved in their deployment, etc.), it is recommended to clearly distinguish information on processing activities carried out for development purposes from the information on other processing activities.
In practice, there are several ways to provide it:
- where individual information is provided, it may appear on the online form used by the website to collect data, be mentioned in emails or letters sent by a data re-user during his or her first contact with data subjects, or be delivered via a pre-recorded voice message, etc.
- when providing general information (i.e. in the cases detailed below), it may for example take the form of information notices published on a freely accessible website or on a notice board.
Ensuring the intelligibility of the information
The information should be as clear and concise as possible (simple vocabulary, short sentences, direct style, etc.) and adapted to the conditions of interaction with people.
The complexity of artificial intelligence systems makes it difficult to draft information that everyone can understand. Nevertheless, understanding the information provided by data subjects is a necessity for them to be able to anticipate the potential consequences of the processing on their privacy. In this regard, it is recommended that controllers, in addition to providing the information set out in Articles 13 and 14 above, define separately and clearly the main consequences of processing: in other words, what will actually be the effect of the specific processing.
The information could thus detail, for example by means of diagrams, how the data is used during the training phase, during the functioning of the AI system developed, as well as the distinction that needs to be made between the training dataset, the AI model and the outputs of the model.
To achieve these objectives, the CNIL recommends setting up information in several levels, prioritizing essential information (identity of the controller, purposes and rights of individuals) at the first level but offering complete information elsewhere.
Special attention should be given to ensure that information regarding the processing of minors' data is sufficiently comprehensible.
Cases where the provision of individual information is not mandatory
The GDPR outlines several derogations from the obligation to inform individuals (e.g. where a provision of EU or national law permits exclusion under Article 23). The developments below focus on the most relevant derogations allowing organizations that have collected training data indirectly.
Situation 1: The data subject has already obtained the information (14.5.a GDPR)
When data subjects have already been informed of all the details of the processing, in particular the purpose and identity of the controller, new information is not necessary.
To be noted: where the data are collected from a third party, the controller will have to ensure that data subjects have already been fully informed about its own processing has already been provided to the data subjects.
As a good practice, the CNIL encourages data re-users to rely on the data diffuser to inform individuals, in particular when the latter is still in contact with the data subjects.
For example, the provider of an online education service could inform its customers that their data will be processed by a named third party in order to develop an AI system for teachers. It would then be necessary to provide all the relevant information about this processing.
Situation 2: Information would require disproportionate effort (Article 14.5.b GDPR)
The controller can then simply make the information publicly available.
This provision is often invoked by organizations that are not or no longer connected with the data subjects (e.g. in the case of re-use of a dataset created by a third party). Indeed, in this case, they usually do not have their contact data.
A case-by-case analysis must be carried out, taking into account the specific context of each processing activity.
Assessing the disproportionate nature of individual information
The organization must assess and document the disproportionate nature by weighing, on the one hand, the interference with the privacy of the data subjects and, on the other hand, the efforts required to individually inform the data subject.
- To determine the extent of these efforts, factors such as the absence of contact details of the data subjects, the age of the data (uncertain accuracy, e.g. contact details older than 10 years), or the number of data subjects, should be considered.
For example: the controller who intends to reuse the data of its customers and still has their email address should always use it to inform them individually.
- In assessing the invasion of the privacy of data subjects and the intrusiveness of the processing, the risks associated with the processing should be considered (more or less directly identifying nature of the data, sensitivity of the data, etc.) as well as any safeguards put in place (such as pseudonymisation, measures resulting from the performance of a data protection impact assessment (DPIA), such as the reduction of the retention period or the implementation of various technical and organizational security measures).
Special case of online data collection via webscraping
- When collecting data published online in pseudonymised form, individual information will most often be disproportionate if it involves searching for or collecting more identifying data such as the actual identity of the individuals.
For example: in the case of re-use of a training dataset lawfully published in open source and containing only pseudonymised data.
- When collecting personal data online that has not been published in pseudonymised form, a case-by-case analysis should be carried out to assess whether it is necessary to seek to inform individuals individually through a means of contact (e.g. by searching for their contact details or by using a messaging system on the website in question).
For example: the provision of general information will be sufficient for the establishment or use of a dataset consisting of film reviews made manifestly public online by persons whose pseudonym is not collected or retained where it would be disproportionate to retrieve their contact data.
For more information, see the how-to-sheet on web scraping for AI
To be noted: this derogation will more easily apply to organizations creating training datasets of AI systems for scientific research purposes.
For example: the provision of general information will be sufficient for the use of a dataset of freely accessible profile photographs in the development of a deepfake algorithm for scientific research purposes.
Appropriate measures that may be taken by the organization in addition to general information
In addition to providing general information publicly available (e.g. by publishing the information on the organization’s website), other appropriate measures may be taken by the organization in this case, such as:
- conducting a Data Protection Impact Assessment DPIA;
- applying data pseudonymisation techniques;
- reducing the number of data collected and the retention period;
- implementing technical and organizational measures to ensure a high level of security.
Best practices for more transparency on development processing
The CNIL also emphasizes the following good practices in terms of transparency:
- publishing any DPIA (this publication may be partial where certain sections are subject to protected secrets such as trade secret or business secrecy);
- publishing any documentation concerning the dataset establishment (for example on the basis of the model proposed by the CNIL),the development process, or the AI system and its operation;
- implementing recommended transparency practices in this area, such as:
- adopting practices related to open source development, such as the publication of model weights, source code, etc.
- transparency on non-data protection practices, such as:
- key concepts of machine learning, such as learning, inference, memorization, or different types of attacks on AI systems;
- the measures implemented to limit harmful or dangerous uses of the system;
The CNIL considers that acceptance by the general public of AI technology requires a better acculturation of the operation of these tools. Therefore, the CNIL encourages stakeholders, including designers and users, to make their practices, as well as the functioning and the risks associated with the use of AI methods, transparent and widely understood.
- In the case of models requiring large-scale data collection, such as Large Language Models (LLMs), the general information may be complemented by a media campaign with different media outlets to inform data subjects.
To note: In particular, certain stakeholders may view adherence to these transparency practices as crucial guarantees when assessing the proportionality of the processing.