Relying on the legal basis of legitimate interests to develop an AI system

02 July 2024

Controllers will most commonly rely on their legitimate interests for the development of AI systems. However, this legal basis cannot be used without respecting its conditions and implementing sufficient safeguards.

This content is a courtesy translation of the original publication in French. In the event of any inconsistencies between the French version and this English translation, please note that the French version shall prevail.

Legitimate interest is one of the six legal bases provided for in Article 6 GDPR.

It is often adapted for the development of AI systems by private bodies, especially when the database used is not based on the consent of individuals (it is often difficult to obtain the consent of individuals at a large scale or when personal data are collected indirectly).

With regard to public bodies, they may rely on their legitimate interests when a public authority wishes to develop an AI system only when the activities concerned are not strictly necessary for the performance of its specific tasks but for other activities legally implemented (such as, for example, human resources management processing).

For more information: on the use of legitimate interest by a public body, see in particular the use case illustrated in “How to choose the legal basis for processing? Practical cases with certain treatments implemented by the CNIL” (French version)

Reliance on legitimate interests is, however, subject to three conditions:

The interest pursued by the body must be “legitimate”;
The processing must fulfill the condition of “necessity”;
The processing must not disproportionately affect the rights and interests of the data subjects, taking into account their reasonable expectations. It is therefore necessary to “balance” the rights and interests at stake in the light of the specific conditions for its implementation.

The controller is bound to examine the compliance of its processing with these three conditions. To this end, it is recommended, as a good practice, to document it. In any event, where a DPIA is necessary, the safeguards provided to limit the possible impacts on the rights of individuals must be described by the controller (see the how-to sheet “Carrying out a data protection impact assessment when necessary”).

Other legal bases may also be considered for the development of AI systems (see the how-to sheet “Ensuring the lawfulness of the data processing - Defining a legal basis”).

First condition: the interests pursued must be “legitimate”

Legitimate interests can be broadly defined. There is no exhaustive list of interests considered legitimate but the interests pursued may be presumed as legitimate if they are cumulatively :

manifestly lawful under the law;
determined in a sufficiently clear and precise manner;
real and present (i.e. non-hypothetical or proven) for the organism concerned.

The more precisely the interests pursued are defined, the narrower the scope of the risks to be taken into account. It is also essential that a clear and precise interest is brought to the attention of data subjects with regards to the controller’s transparency obligations.

Thus, the following interests could be considered a priori legitimate for the development of AI systems:

carry out scientific research (in particular for bodies which cannot rely on the public interest mission);
facilitate public access to certain information;
develop new systems and functionalities for users of a service;
offer the service of a conversational agent to assist users;
improve a product or service to increase its performance;
develop an AI system to detect fraudulent content or behaviour.

The commercial purpose of the development of an AI system is not in itself contradictory to the use of the legal basis of legitimate interests.

Conversely, certain interests cannot be regarded as legitimate, in particular where the AI system envisaged has no connection with the mission and activity of the body or where it cannot be lawfully deployed.

Example: targeted advertising based on profiling to minors is prohibited by Article 28.2 of the Digital Services Act (DSA). An AI system intended to carry out automatic profiling of minors with a view to targeting them with advertising cannot therefore be legally deployed. The interest in developing such a system cannot therefore be regarded as lawful.

Please note: more generally, the development of systems which are categorically prohibited by regulations other than the GDPR cannot be regarded as legitimate. In this regard, particular attention should be paid to the AI-specific categorisation used in the European AI Act. It provides for a ban on the placing on the market, the putting into service or the use of certain AI systems. The development of systems which are exclusively intended for such uses cannot, therefore, be considered lawful and it will not be possible to mobilise the legitimate interest, or any other legal basis, to carry out the processing. It will be up to the controller to comply with future regulations and to keep up to date with future developments.

Second condition: the processing must be "necessary"

Third condition: ensure that the objective pursued does not threaten the rights and freedoms of individuals

In order to rely on legitimate interests, the legitimate interests pursued shall not disproportionately impact the interests, rights and freedoms of data subjects.

The controller must therefore balance their legitimate interests against the data subject’s interests, rights, and freedoms. To do this, the controller must measure the benefits of its treatment (anticipated benefits, including those presented below) but also the impacts on individuals. If necessary, additional measures must be put in place to limit these risks and protect the rights and freedoms of individuals.

The benefits provided by the AI system help justify the processing of personal data

The greater the anticipated benefits of the processing, the more likely the legitimate interest of the controller is to prevail over the rights and freedoms of individuals.

The following factors make it possible to measure the positive impact of the interests pursued:

The extent and nature of the expected benefits of the processing, for the controller but also for third parties, such as the end-users of the AI system or the interest of the public or society. The diversity of applications implementing AI systems shows that there can be many benefits, such as improved healthcare, better accessibility of certain essential services, facilitation of the exercise of fundamental rights such as access to information, freedom of expression, access to education, etc.

Example: A voice recognition system that allows users to automatically transcribe their words and, for example, helps with filling in administrative forms, can have significant benefits in making certain services accessible to people with disabilities. The importance of those benefits may be taken into account in the balancing of interests when developing such a system.

An interest of an exclusively commercial nature, for the sole benefit of the controller, has fewer benefits than a scientific research objective, which can benefit the scientific community as a whole: it will generally be necessary to provide for more measures to protect data subjects’ rights for the former than for the latter.

However, a commercial interest may converge, to a certain extent, with a public interest. In general, the fact that a controller acts not only in his own interest but also in the interest of the community, may give more “weight” to that interest.

Example: A private company wants to develop an AI system to combat online real estate fraud. The commercial interest it pursues is reinforced by the convergence with the interest of users and the interest of the community to reduce fraudulent activities.
The usefulness of the processing carried out in order to comply with other legislation.

Example: the provider of a very large online platform or search engine that develops an AI system to better meet the provisions of Article 35.1 of the Digital Services Act (DSA) on the adaptation of online content moderation processes, may take this objective into account when assessing its interest.
The development of open-source model, which, provided sufficient safeguards are in place (see the dedicated focus sheet and article) may have significant benefits for the scientific community, research, education and the adoption of these tools by the public. It may present benefits in terms of transparency, reduction of bias, accountability of the AI system provider or peer review. This may reflect the controller’s objective of sharing the benefits of its processing to participate in the development of scientific research.
The specification of the interests pursued: the more precisely an interest is defined, the more weight it will be have in the balancing exercise, since it makes it possible to specifically apprehend the reality of the benefits to be taken into account. Conversely, an interest defined too broadly (e.g. “providing new services to its users”) is less likely to prevail over the interests of individuals.

Negative impacts on people need to be identified

Those benefits must be balanced against the impact of the processing on data subjects. Specifically, the controller must identify and assess the consequences of all kinds, potential or actual, that the development of the AI system and its subsequent use could have on the persons concerned: their privacy, data protection and other fundamental rights (freedom of expression, freedom of information, freedom of conscience, etc.) as well as other concrete impacts of the processing on their situation.

The actual impacts of the processing on people, as listed below, are to be assessed according to the likelihood that the risks materialise and the severity of the consequences, which depend on the particular conditions of the processing, as well as the AI system developed.

To this end, it is necessary to take into account the nature of the data ( sensitive data, highly personal data), the status of the data subjects (vulnerable persons, minors, etc.), the status of the company or administration developing and/or deploying the AI system (the effects may be multiplied in the event of very wide use of AI), the way in which the data are processed (data crossing, etc.) or the nature of the AI system and the intended operational use. In some cases, the impact on individuals will therefore be limited, either because the risks are low or because the consequences present little severity with regard to the data used, the processing carried out and the interest pursued (for example, the development of an AI system used for the personalisation of an auto-completion feature of a text editor software presents little risk for data subjects).

The following impacts on people should therefore be considered and the level of associated risks should be assessed in the case at hand. Three types of risks can be distinguished:

1. Impacts on individuals related to the collection of data used to develop the system, in particular where data have been scraped online

Risks of infringement of privacy and rights guaranteed by the GDPR: the use of scraping can lead to significant impacts on individuals, due to the large volume of data collected, the large number of data subjects, the risk of collecting data relating to the privacy of individuals (e.g. use of social networks) or even sensitive or highly personal data, in the absence of sufficient safeguards. These risks are all the more important as they may also concern the data of vulnerable persons, such as minors, who need to be given particular attention and informed in a sufficiently appropriate manner.
Risks of illegal collection: certain data may be protected by specific rights, in particular intellectual property rights, or their re-use subject to the consent of individuals.
Risks of undermining freedom of expression: an indiscriminate and massive collection of data and their absorption in AI systems which could potentially regurgitate them is likely to impact the freedom of expression of data subjects (a feeling of surveillance that could lead internet users to self-censor, especially given the difficulties in preventing publicly available data from being scraped), whereas the use of certain platforms and communication tools is necessary on a daily basis.

For more information: see the focus sheet on the measures to implement in case of data collection by web scraping.

2. Impacts on individuals related to model training and data retention

Risks of loss of confidentiality of the data contained in the dataset or in the model: the risks related to the security of training datasets are likely to increase the risks for data subjects linked to misuse, in particular in the event of a data breach, or the risks related to attacks specific to AI systems (attack by poisoning, backdoor insertion or model inversion).

For more information: see the LINC article “Small taxonomy of attacks on AI systems” (in French).

Risks related to the difficulty of ensuring the effectiveness of the data subject rights, in particular due to technical obstacles to the identification of data subjects or difficulties in transmitting requests for the exercise of rights when the dataset or model is shared or available in open source. It is also complex, if not technically impossible, to guarantee data subject rights on certain objects such as trained models.
Risks associated with the difficulty of ensuring transparency towards data subjects: these risks may result from the technicity inherent to these topics, rapid technological developments, and the structural opacity of the development of certain AI systems (e.g. deep learning). This complicates the availability of intelligible and accessible information on the processing.

3. Impacts on persons related to the use of the AI system

Certain risks, which may arise when using the AI system, must be taken into account at the training stage because of their systemic nature. It is indeed necessary to anticipate, as soon as the design phase, the guarantees necessary to effectively limit these risks. These risks depend on the uses of the AI system. As a general matter, the following risks can be mentioned in particular:

The risks of memorization, regurgitation or generation of personal data when using certain AI systems, which may invade privacy. It is possible in some cases to infer, accidentally or by attacks (membership inference, extraction or model inversion), personal data contained in the learning database from the use of AI systems (see in particular the LINC article “Small taxonomy of attacks on AI systems” - in French). This presents a risk for privacy of data subjects whose data might appear when using the AI system (reputational risk, security risk depending on the nature of the data, etc.).
Risks of reputational damage, spread of false information or identity theft, where the AI system (particularly generative AI) produces content on an identified or identifiable natural person (e.g. a generative image AI system may be used to generate false pornographic pictures of real persons whose images are contained in the dataset). Note that this risk can also occur with AI systems that have not been trained with personal data.

Example: a news article generated by an AI system may present defamatory information about a real person, although the dataset does not contain information about that person, in particular where the text was generated at the request of a user who specifies the identity of the data subject in the prompt.
Risks of infringement of certain rights or secrets provided for by law (e.g. intellectual property rights, such as copyright, business or medical secrecy) in the event of memorisation or regurgitation of protected data.

Example: a text-generative AI system trained on copyright-protected literary works may generate content that constitutes infringement, in particular where that content results from the regurgitation of the content that would have been memorised by the AI system.
Serious ethical risks, which may impact certain general legal principles or the proper functioning of society as a whole, related to the development of certain AI systems. These risks must be taken into account in the assessment (e.g. discrimination, the safety of people in case of malicious use, incitement to hatred or violence, disinformation, which may undermine the rights and freedoms of individuals or democracy and the rule of law). The development of AI systems, especially for general use, can thus harm certain fundamental rights and freedoms during the deployment phase if safeguards are not anticipated by design (e.g. amplification of discriminatory biases in the training database, lack of transparency or explainability, lack of robustness or automation biases, etc.).

Reasonable expectations of individuals are a key factor in assessing the legitimacy of treatment.

The controller must take into account the reasonable expectations of the data subjects when assessing the impact of the processing on individuals. Relying on legtimate interests requires that individuals are not surprised by the modalities of implementation as well as the consequences of the processing.

Reasonable expectations are a contextual aspect that the controller must consider when balancing the rights and interests at stake. To this end, information to individuals may be taken into account to assess whether data subjects can reasonably expect their data to be processed. However, it will only be an indicator.

With regards to the development of AI systems, certain processing are likely to exceed the reasonable expectations of individuals, including the following processing:

In case of re-use of data published on the internet: given the technological developments of the last decade (big data, new AI tools, etc.), people may be aware that some of the data they publish online may be collected and reused by third parties. However, it is not possible to consider that they can expect such processing to take place in all situations and for all types of data accessible online concerning them. In particular, it is necessary to take into account:
- the nature of the websites where the data has been collected (social networks, online forums, dataset dissemination websites, etc.)
- the restrictions imposed by these websites, for example in the general terms of use;
- the type of publication (for example, an article published on a freely accessible blog is not private, whereas a post on a social network published with access restrictions may be considered private and for which the internet user is less aware of being exposed to collection and reuse by third parties).
It may be difficult to understand the variety of potential uses of a dataset or a model, especially when it is distributed or shared. However, some of those uses may fall outside people’s reasonable expectations, in particular in the case of unlawful reuse, in so far as data subjects could not expect their data to allow the development of AI systems reused for certain purposes.

Example: data subjects could not expect their data to be used to develop an open source image classification model, which is then used to classify people according to their sexual orientation.

Introduce additional measures to limit the impact of processing

The organisation may provide for compensatory or additional measures in order to limit the impact of the processing on the data subjects. These measures will often be necessary to reach a balance between the rights and interests at stake and allow the controller to rely on this legal basis.

These measures are in addition to those necessary to comply with other obligations laid down by the GDPR, and should not be confused with them: compliance with these provisions is imperative, regardless of the legal basis for the processing (data minimisation, data protection by design and by default, data security, etc., see the dedicated how-to sheets). Compensatory measures consist of obligations fulfilled in a “premium” manner, as thorough as possible, or additional safeguards to the requirements of the GDPR.

They may be of a technical, organisational or legal nature and must be able to limit the risk of harm to the interests, rights and freedoms previously identified.

The following measures have been identified as relevant to limit the impact on data subject rights and freedoms. They must be adapted to the risks posed by the different processing of the development phase.

1. In response to the risks associated with the collection and compilation of the dataset:

Anonymise at short notice or, failing that, pseudonymise the data collected. In some cases, anonymisation will be necessary, where anonymised data is sufficient to achieve the purposes defined by the controller.

Example: If a company wants to build a dataset from comments accessible online to develop an AI system to assess the satisfaction of customers who have purchased its products, pseudonymisation of data collected at short notice after collection may be an additional measure to limit the risks associated with data collection that may reveal a lot of information about the person making the comments.
Where it does not adversely affect the performance of the model developed, synthetic data should be used. This may also have a number of advantages, such as making certain data available or accessible and modelling certain specific situations, avoiding the use of real data, particularly sensitive data, increasing the volume of data for training or minimising the risks associated with data confidentiality, etc. It should be borne in mind that synthetic data are not systematically anonymous.

Example: if a supplier wishes to develop an image classification system that automatically detects the carrying or use of a weapon, the use of computer-generated images allows, for example, to avoid the collection of data likely to presume the commission of an offence, to help varying the possible configurations or to improve the representativeness of the dataset, in particular because of the possibility of describing the characteristics of the synthetic image of the person (size, weight, skin colour, etc.) and of the weapon to be detected (shape, colour, etc.).

For more details on the measures to take in case of web scraping, see the dedicated focus sheet.

2. In response to risks related to model training and data retention

Implement technical, legal and organisational measures in addition to the obligations laid down in the GDPR in order to facilitate the exercise of rights:
- Provide a discretionary right to object and, when collecting the data directly from data subjects, before the processing takes place, in order to strengthen the control of individuals over their data. This is particularly relevant when an organisation deploys an AI system and intends to reuse usage data for the purpose of improving its system. To ensure that this guarantee is fully effective, individuals must be able to object to such processing without impacting their use of the service. Where data have been collected indirectly, technical and organisational measures should be considered to retain metadata or other information on the source of the collection in order to facilitate the search for a person or data within the dataset.
- Implement measures that ensure and facilitate the exercise of individuals’ rights when the model is subject to the GDPR. These measures might include observing a reasonable period of time between the dissemination or collection of a training dataset and its use (especially when the exercise of rights on the model is difficult) and/or providing for periodic re-training of the model to take into account a data subject’s rights request when the controller still has the training data.
- Where the model is shared or disseminated in open source, identify and implement measures to ensure the transmission of the exercise of rights through the chain of stakeholders, in particular by providing in the general conditions for the obligation to pass on the effects of the exercise of rights of opposition, rectification or erasure to systems developed subsequently. In the most risky cases, ensure traceability of downloads of models available in open source (e.g. by keeping contact information of the persons or organisations downloading them) to allow the exercise of rights to be reflected along the chain of actors.
Ensure greater transparency regarding the processing carried out for the development or improvement of the AI system. Beyond the information required by Articles 13 and 14 of the GDPR, implementing measures that provide increased transparency for AI systems can help mitigate some of the risks associated with their technological specificity, lack of knowledge about their functioning, etc.
Implement measures and procedures to ensure a transparent development of the AI system, in particular to enable auditability during the deployment phase (documenting the entire development process, logging activities, managing and monitoring the different versions of the model, recording parameters used, and conducting and documenting evaluations and tests). This may also be necessary to prevent automation or confirmation bias during deployment.
Ensure effective peer review of the model’s development, for example by consulting or including expert researchers in the development process or by developing the model in open source. Depending on how it is opened and provided that sufficient guarantees are in place, open-source development can offer various advantages (see the focus sheet on the CNIL website).
Considering the severity and likelihood of the risks identified, establish an ethics committee, or, depending on the size and resources of the organization, an ethics referent, to work with AI systems providers, in order to take into account ethical issues and the protection of the rights and freedoms of data subjects, upstream and throughout the development of those systems (for more information, see the how-to sheet “Taking into account data protection when designing the system”).

3. In response to the risks associated with the use of the AI system

Implement measures to prevent the storage, regurgitation or generation of personal data, especially in the context of generative AI systems.

For more information: see the dedicated questionnaire and the how-to sheet “Respecting and facilitating the exercise of individuals’ rights”.
For general purpose AI systems, mitigate the risk of unlawful reuse by implementing technical measures (e.g. digital watermarking of AI-generated outputs to prevent the use of the system for misleading purposes or restricting functionalities by excluding by design those that could lead to unlawful uses) and/or legal measures (e.g. including contractual prohibitions against certain unlawful or unethical uses of the dataset or AI system, which data subjects could not reasonably anticipate).
Implement measures to address significant serious ethical risks.

For instance, ensure the quality of the training dataset to reduce the risk of discriminatory bias during usage, in particular by ensuring data representativeness and identifying and correcting biases in the dataset or resulting from the annotations (see the “Annotating data” how-to sheet).

For more information: a how-to sheet on detecting and reducing bias will be published at a later date.

Steps	Risks	Measures
Data collection	In particular for web scraping, in cases where this does not fall within the reasonable expectations of persons: - Invasion of privacy - Impact on freedom of expression	Dedicated measures (see focus sheet on web scraping)
Model training and data retention	Loss of confidentiality of training data	- Anonymisation / pseudonymisation of data during collection - Use of synthetic data
	Invasion of privacy and loss of confidentiality related to data memorization/regurgitation in the model	Limiting the risks of memorization, regurgitation and generation of personal data (see questionnaire on the application of the GDPR to AI models)
	Lack of transparency and opacity of processing	- Increased transparency - Transparent development of the AI system and its auditability - Effective peer review of model development (see the focus sheet on open source)
	- Difficulty in guaranteeing the exercise of rights	- Facilitation of the exercise of rights - Discretionary right to object - Reasonable time between the creation of the training dataset and its use; - Transmission of the exercise of rights
Use of the AI system	- Invasion of privacy and loss of confidentiality related to data memorisation/regurgitation in the model - Damage to reputation - Regurgitation of protected data	- Limiting the risks of memorization, regurgitation and generation of personal data (see the questionnaire on the application of the GDPR to AI models)
	- Discriminatory biases	- Ensuring the quality of the dataset - Annotation as soon as the dataset is created - Application of filters in the deployment phase
	- Unlawful reuse	- Reuse licenses - Digital watermarking