The principle
The controller is the natural or legal person who determines the purposes and means of the processing, i.e. who decides on the “why” and “how” of the use of personal data.
The essential means of processing are those which are closely related to the purpose and scope of the processing, such as the type of personal data collected, the hardware and software used for the processing as well as their security, the duration of the processing, the categories of recipients and the categories of data subjects.
In practice
Some clues may help to conduct the analysis on a case-by-case basis to determine who is responsible for the processing.
A provider who is at the initiative of the development of an AI system and who creates the training dataset based on data that it has selected on its own account, may be qualified as a controller.
The same applies to a supplier who entrusts the creation of such a base to a service provider through sufficiently detailed documented instructions (see the role of processor below).
It should be noted that in some cases a provider will have recourse to a third-party who has already created a dataset as the controller (on its own initiative). It will then be necessary to identify the processing for which the provider is responsible, such as the re-use on its own account of a dataset already constituted.
Examples of controllers:
- A video streaming platform wants to develop a recommendation AI system. For this purpose, it reuses a dataset of its customers that was originally collected for the purpose of providing the service.
The streaming platform that creates the training dataset is responsible for this new processing since it has decided on the purpose (train a recommendation AI system) and the essential means of processing (i.e. the dataset it has already collected for another purpose).
- The provider of a conversational agent who trains its large language model (LLM) from publicly available data on the Internet is controller of the reuse of publicly available personal data on the Internet. Indeed, the provider decides both the purpose (proposing a conversational agent) and the essential means of processing (selecting the data to be re-used).
- A provider develops an AI system based on a pre-trained model with personal data. The provider intends to retrain or adjust the model (through fine-tuning or transfer learning) with a dataset that it set up, at its initiative. In such a case, that provider will have to be classified as a controller, provided that it pursues a purpose of its own and for which it determines itself the essential means.
-
Reuse of data collected by another organisation
When the provider trains its AI system with data collected by another entity, it is necessary to distinguish:
- the data diffuser: the natural or legal person, public or private, who uploads online personal data or a dataset that contains personal data;
- the re-user of the data: the natural or legal person, public or private, who processes such data or datasets with the intention of using them on its own account.
The diffuser and the re-user of the data are, in principle, responsible for separate processing, since each determines the objectives and the essential means of its own processing.
The data diffuser is, in principle, responsible for the public dissemination, while the provider of the AI system that re-uses the data is responsible the usage of the data it has. The diffuser is not, in principle, responsible for the re-use of its data. It may, however, lay down conditions for the use of the data disseminated to limit its reuse or provide for certain provisions.
Example:
An administration makes real estate data publicly available and freely reusable (open data). A company wants to reuse this data to create a training dataset in order to develop an AI system able to predict certain real estate developments in a given area. The diffuser and the re-user are then responsible for separate processing, provided that these two processings are independent.
Find out more: Sheet 1 of the guide on the opening and reuse of publicly accessible data.