Respect and facilitate the exercise of data subjects’ rights
Individuals whose data is collected, used or reused to develop an AI system have rights over their data that allow them to maintain control over it. The controllers are responsible to comply with them and to facilitate their exercise.
The exercise of rights is closely linked to the provision of information on processing operations, which is also an obligation.
General reminder on applicable rules
Data subjects have the following rights over their personal data:
- right of access (Article 15 GDPR);
- right of rectification (Article 16 of the GDPR);
- right to erasure, also known as the right to be forgotten (Article 17 of the GDPR);
- right to restriction of processing (Article 18 GDPR);
- right to data portability where the legal basis for processing is consent or contract (Article 20 GDPR);
- right to object where the processing is based on legitimate interest or public interest (Article 21 GDPR);
- right to withdraw consent at any time where the processing of personal data is based on consent (Article 7(3) GDPR).
They must be able to exercise their rights on both training datasets and AI models if the latter are not considered anonymous (see the questionnaire on the application of the GDPR to AI models).
It must be possible for data subjects to exercise their rights on simple written or oral request.
The controller must inform the data subjects of the person to be contacted for this and implement an internal procedure laying down the conditions for the management and monitoring of the exercise of rights.
With regard to the development of AI models or systems: how to respond to data subject rights?
In practice, it is quite different to respond to such requests depending on whether they relate to the training data or to the model itself.
For the exercise of rights on training datasets
Difficulties in identifying the data subjects
If the controller does not need or no longer needs to identify the data subject and can demonstrate that it is unable to do so, it does not have to store or collect additional information for the sole purpose of enabling the exercise of rights. Article 11 GDPR then requires it to inform individuals accordingly, if possible.
This will often be the case for the creation and use of training datasets, whether annotated or not. The provider of an AI system does not, in principle, need to identify the persons whose data are present in its training set. It does not therefore have to keep identifiers in a pseudonymised dataset to enable the re-identification of persons for the sole purpose of responding to requests for the exercise of rights.
For example:
- An organisation does not have to keep the faces of an image dataset just to allow the exercise of rights over a set of photographs if compliance with the principle of minimisation requires it to blur the faces.
- An organisation does not have to keep identification data in the dataset for the sole purpose of allowing the possible withdrawal of consent of individuals if they have been duly informed of this and have consented to it in an informed manner (e.g. by being informed when erasure of the data stored by the AI model will not be possible).
However, there are two limits to this logic:
- First, in this particular case, the GDPR permits individuals to provide additional information, which will sometimes allow them to be identified and to exercise their rights.
In this regard, the GDPR does not provide for any particular form. A data subject could therefore, for example, provide an image or a pseudonym (such as a username under which the data subject has an online activity).
It is recommended to anticipate these identification difficulties by indicating the information likely to allow the identification of the data subject (e.g. in the case of homonymy). This may include allowing them to provide a particular file (such as an image, audio or video recording) but also other information when this is not possible.
- Second, certain safeguards relating to the exercise of rights may be additional measures necessary to rely on legitimate interest (see how-to sheet on legitimate interest), for example by providing individuals with a prior and discretionary objection mechanism. If the controller intends to avail himself of it, it may then process data to allow the identification of individuals and thus guarantee them effective control over their data.
Technical and organisational measures should be considered to retain a certain amount of metadata or other information on the source of the data collection in order to facilitate the search for a person or data within the dataset. This will be particularly relevant for information that is already publicly available and the retention of which does not create additional risks for the data subjects.
Point of vigilance: the use of overly intrusive methods cannot be justified by the need to exercise the rights.
If the controller can identify individuals, it must respond to their requests to exercise rights and in particular to the following.
With regard to information on data processing (under the right of access):
Allowing data subjects to know the precise recipients and sources of their data is essential to ensure effective control over their data. This information allows them to exercise their rights vis-à-vis the controllers holding their data. This is particularly important due to the chain of actors in the development of AI systems which is often complex.
Information on recipients or categories of recipients
While information may be limited to categories of recipients, the right of access allows a person to obtain specific information about the processing of his or her data.
Where the data subject requests so, the right of access includes the right to obtain the identity of the recipients of the processed data and not only the categories of recipients (as recalled by the ECJ in Case C-154/21 of 12 January 2023).
An organisation can respond to such requests by only providing information on the categories of recipients if it is not possible to identify them precisely (e.g. in the case of an open source publication) or where it demonstrates that the request is manifestly unfounded or excessive.
In particular, if the dataset is intended to be shared with a large number of third parties (for example in the context of research work), the CNIL recommends setting up an authentication or API mechanism to record the identities of third parties and the data accessed.
Information on the source of the data
Where the data have not been collected directly from the data subjects (for example, in the case of collection from a data broker), the right of access allow individuals to obtain any available information about the source of their data.
For example:
- When using web scraping techniques to collect data accessible online , the controller must retain the domain names, but also the URLs, of the web pages on which the data were collected in order to be able to transmit them to the data subjects who request them.
- When re-using a dataset accessible online, the controller must retain the identity of the source controller. The CNIL recommends that it provides persons exercising their right of access with the means to contact this source. A good practice is to provide an understandable description of the data it contains as well as the conditions of their collection.
Where the training dataset has been compiled from more than one source (e.g. from more than one dataset), the controller must take traceability measures to be able to provide each data subject with accurate information on the sources of their data. Where this is not possible, they should be provided with all relevant information, i.e. on all sources used.
With regard to the right to receive a copy of the learning data
The right of access allows any individual to obtain, free of charge, a copy of all the data processed concerning him or her. This presupposes the right to obtain extracts from training datasets where this is indispensable to enable the data subject to effectively exercise his or her other rights (ECJ, 4 May 2023, Case C-487/21). In this regard, the CNIL recommends providing the data itself, but also the associated annotations and metadata in an easily understandable format.
Such communication shall be without prejudice to the rights and freedoms of others, which shall include in particular the rights of other data subjects, or the intellectual property rights or trade secrets of the dataset holder.
With regard to the modification of the database
Data subjects have the right to supplement or rectify data concerning them. In the present case, this may concern incorrect or incomplete annotations which the data subjects would like to correct.
The right to erasure makes it possible to require the deletion of data in a number of cases, for example where the data subject reveals to the controller that it holds sensitive data concerning him/her (within the meaning of Article 9 GDPR) and for which no derogation justified their processing.
Furthermore, where the processing is based on a legitimate interest or the performance of a task carried out in the public interest, data subjects may, at any time, object to it on grounds relating to their particular situation. Such a reason would result, for example, from the retention of compromising photographs in a dataset or statements made on the web when the person was a minor.
When web scraping data accessible online, a good practice is to set up a “black list”. This list, managed by the controller, would allow data subjects to object to the collection of their data on certain websites or online platforms by providing the information allowing their identification on those different sites, including upstream of the collection. This measure can also contribute to the proportionality of the processing operation.
With regard to the notification of the rights exercised (rectification, limitation and erasure)
Article 19 GDPR provides that a controller shall notify each recipient to whom the personal data have been disclosed of any rectification, erasure of personal data or restriction of processing carried out, unless such communication proves impossible or requires disproportionate effort.
It must take into account available technologies and implementation costs. For example, the CNIL recommends the use of application programming interfaces (APIs) (particularly in the most risky cases),or at least techniques for managing the logging of data downloads.
Furthermore, when sharing a dataset, a good practice is to provide for a contractual obligation (e.g. in the dataset re-use license) to pass on the effects of the exercise of the rights of opposition, rectification or erasure by their re-users.
For the exercise of rights on models subject to the GDPR
Training an AI model with personal data can, in some cases, lead to the application of the GDPR on the model in question (see dedicated questionnaire).
With regard to the concept of personal data applied to generative models
Outputs from a generative AI model may be considered personal data when they relate to an identified or identifiable natural person, regardless of their accuracy. In particular, this will be true for regurgitations of large language models (LLMs) that generally seek to be as accurate as possible but can produce completely false outputs on a particular person.
On the other hand, the provider of the generative AI system will not be responsible for the processing of the personal data contained in the outputs if it does not result from the “memory” of the model but from statistical inference from personal data provided in the prompt. In this case, the processing of such data will be the responsibility of the user of the system.
In some extreme cases, a model could be built to generate purely synthetic and fictitious information, clearly labeled as such. However, its output could coincidentally relate to a real individual. This could happen to a model trained from anonymous data and used to generate fictions, but in which the names of some characters in these fictions correspond to the names of real people. In such cases, a case-by-case analysis is necessary to determine whether this information may be exempt from the personal data regime and the GDPR.
With regard tp the identification of the data subject within the model
Where the GDPR applies to the model, the controller may still demonstrate that it is unable to identify individuals within the model as per Article 11 GDPR. In this case, the controller must, if possible, inform the data subjects and provide a means for them to supply additional information enabling their identification so they can exercise their rights.
Most of the time, the controller will be able to demonstrate that the state of the art of current techniques is not advanced enough to allow identification of personal data from the weights of the model (see box below for more information). However, there are special cases such as when the model parameters explicitly contain certain training data (which may be the case for certain models such as support vector machines or SVMs or certain data partitioning algorithms,or clustering): it will be technically possible to exercise the rights over the weights of the model.
The CNIL recommends that the model provider informs the data subjects of the additional information to be provided for their identification. To do this, it can rely on the typology of training data to anticipate the categories of data that have been likely to be memorized.
In the case of generative AI, the CNIL also recommends that the designer develop an internal procedure to query the model (for example using a list of carefully chosen queries) to verify what data it may have memorized on the person based on the information provided.
Influence of a person’s data on model parameters
When training the most complex machine learning AI models such as neural networks, training data is used to optimise model parameters. However, due to the large number of iterations required to train the model and the nature of the techniques used (such as gradient descent), the contribution of each piece of data to the model is diffuse and methods to trace it are still under research.
The field of machine unlearning, whose principle, advantages and limitations are described in the LINC article “Understanding machine unlearning”(in French), is frequently showing promising advances. One such technique is the use of influence functions, allowing to keep track of the contribution of data on model parameters during training, and could solve the question of the contribution of each data. These techniques would, for example, make it possible to identify the weights influenced by the outliers contained in a dataset and thus correct their value.
Pending the progress of research on these topics, it does not seem possible, in the general case, for a controller to identify the weights of an AI model corresponding to a particular person. On the other hand, it may sometimes be possible to determine whether the model has learned information about a person by conducting fictitious tests and attacks, such as membership inference attacks or attribute reconstruction attacks. These attacks are described in the LINC article “Small taxonomy of attacks on AI systems” (in French).
The CNIL alerts data controllers that this finding is only valid in the light of the current state of the art. The CNIL actively encourages the research and the development of practices to respect the exercise of rights and the principle of data protection by design.
With regard to information on the processing of the model (under the right of access) and the right to obtain a copy of the data:
Where the controller has been able to identify the data subject and verify that a data storage has taken place, the controller must confirm this to the data subject. When it has not been possible to verify the presence of a memorisation, but could not exclude that possibility either (in particular because of the technical limitations brought by the current methods), the CNIL recommends informing individuals that it is not impossible that training data concerning them have been memorised by the model.
In addition to the information on training data, some model-specific information needs to be provided when its processing is subject to the GDPR. If the controller still has the training data, he/she can reply to the person confirming that his/her data has been processed for learning and that it may have been memorised. Where applicable, information specific to the model should be provided, such as information on the recipients of the model (15.1.c), its retention period or the criteria used to determine it (15.1.d), the rights that can be exercised on the model (15.1.e and f), as well as its provenance when it has not been designed by the controller (15.1.g).
Where the data subject has exercised his or her right to obtain a copy of his or her data but the controller cannot identify all of his or her data stored by the model, the controller should provide the data subject with the outcome of his or her investigations, in particular examples of outputs containing his or her personal data in the case of generative AI systems.
With regard to the exercise of the rights to rectification, opposition or erasure on the model
Although the identification of data from the parameters of a model faces significant technical difficulties to date, there are some practical solutions to address the exercise of rights.
When the controller still has the training data, re-training the model makes it possible to respond to the exercise of rights following their application in the dataset. It then makes it possible to respond to the exercise of rights on models subject to the GDPR.
The re-training of the model should therefore be considered whenever it is not disproportionate to the rights of the controller, in particular the freedom to conduct a business. In practice, this will essentially depend on the sensitivity of the data and the risks that their regurgitation or disclosure would pose to individuals.
This re-training may take place periodically in order to integrate several requests for the exercise of rights at the same time. In principle, the controller must respond to a request as soon as possible, and at the latest within one month. However, the GDPR provides that this period may be extended by two months in view of the complexity and number of requests (e.g. depending on the extent of the re-training to be carried out), provided that the data subject is informed accordingly.
The controller will then have to provide an updated version of the AI model to its users, if necessary by contractually requiring them to use only a regularly updated version.
Good practice
Since the technical solutions currently available are not satisfactory in all cases where a model is subject to the GDPR, it is good practice to observe a reasonable period of time between the time when the training dataset is set up and the time when the model is trained in practice or following the dissemination of the dataset, in order to allow data subjects to exercise their rights upstream.
Questioning:
The CNIL questions the cases, conditions and technical solutions for which re-training would not be possible or would prove disproportionate.
In its first analysis, the CNIL considers that model degradation techniques, for example by machine unlearning techniques, do not seem to be advanced enough to be recommended yet.
However, it wonders about the relevance of other techniques, such as modifying the model, for example by fine-tuning on other data.
Where the exercise of rights does not lead to a re-training of the model, the CNIL is considering to recommend measures to filter the outputs of a system to enable the exercise of the rights to rectification, opposition or erasure to be addressed if the controller demonstrates that they are sufficiently effective and robust (i.e. that they cannot be circumvented).
In practice, such measures could include changes to the AI system in order to limit its outputs.
In the case of generative AI systems, this could result in the addition of filters preventing regurgitation or the production of outputs concerning the person. In this regard:
- It would be advisable to use general rules preventing the generation of personal data, rather than a “blacklist” of data subjects who have exercised their rights.
- These general rules should seek to detect and pseudonymise the personal data concerned in the outputs, where “generation” of personal data is not the intended purpose, for example through named entity recognition techniques.
- Where a black list is needed, its safety should be ensured and the impact of the filter on outputs should be assessed, in particular by determining the increasing risk of an inference-of-membership attack that it may induce by modifying the statistical distribution of the outputs.
The provider of the AI model or system should then provide its users with the means to implement those measures to allow them to meet their own obligations.
However, those measures remain less relevant than the re-training of the model referred to above. Besides, in any case, they are not always applicable, in particular in the light of the impact of the processing on the person exercising his or her rights, or because of the way the AI system is used or made available.
Given the practical difficulties of these technical solutions for modifying the model, the CNIL would recommend, as a matter of priority, implementing measures to make it impossible “by design” to identify the person in the training data, for instance through anonymisation techniques or measures preventing memorisation or regurgitation. These measures make it possible both to exclude the exercise of rights on the model and to limit the risks for individuals.
Finally, the CNIL questions the responsibility of respecting the data subject rights over models whose processing is subject to the GDPR (is this the provider’s sole responsibility or also the responsibility of the users of the model?).
Give your opinion: we invite you to share your views on these questions by answering the questionnaire on the application of the GDPR to AI models.
Derogations from the exercise of rights on datasets or on the AI model
Point of vigilance: the controller availing certain derogations must inform the person in advance that his or her rights are subject to restrictions and explain the reasons for the refusal of the exercise of a right to the data subjects who have requested it.
Apart when the controller is unable to identify the data subjects (see the dedicated developments above), the controller may derogate from the exercise of rights in the following cases:
- The request is manifestly unfounded or excessive (Article 12 GDPR).
- The organisation receiving the request is not the controller of the processing in question.
- The exercise of one or more rights is excluded by French or European law (within the meaning of Article 23 GDPR).
- For processing operations for scientific or historical research purposes, or for statistical purposes: where the rights would be likely to render impossible or seriously impede the achievement of those specific purposes (within the meaning of Article 116 of the Decree implementing the French Data Protection Act),
- For the right of opposition: the controller invokes a compelling legitimate ground that prevails over the grounds relating to the the particular situation of the data subject (Article 21 GDPR).
Focus on the right to object
The controller must demonstrate compelling legitimate grounds for the processing in order not to comply with a request to exercise the right of opposition exercised by a data subject.
To that end, the controller will have to weigh up the grounds relating to the particular situation of the person exercising his or her right to object against the legitimate and compelling grounds on which he or she relies for derogating from it.
In particular, this balance will depend on the risks incurred by the data subject. As regards AI models subject to the GDPR, it will be necessary to assess, in particular, the risks of disclosure of training data (for example by means of regurgitation), and the risks that may be generated by such disclosures (which will depend on the degree of sensitivity of the data).