Annotating data

02 July 2024

The data annotation phase is crucial to ensure the quality of the AI model. This challenge can be achieved by means of a rigorous methodology guaranteeing performance for the system and protection of personal data.

This content is a courtesy translation of the original publication in French. In the event of any inconsistencies between the French version and this English translation, please note that the French version shall prevail.

The data annotation phase is a decisive step in the development of a quality AI model, both for performance issues and for the respect of people’s rights. This step is central in supervised learning, but also to provide a validation dataset in unsupervised learning. It consists of assigning a description, called label, to each of the data that will serve as a “ground truth” for the model which must learn to process, classify, or discriminate data based on that information. The annotation may relate to all types of data, personal or not, and contain all types of information, personal or not. Annotation can be human, semi-automatic, or automatic. It can be an independent process, or result from existing processes in which data characterisation has already been performed for a certain need and then reused for training AI models (as in the medical diagnosis use-case described below). In some cases, AI training will be based on existing data and annotations. This sheet, as well as those on data protection during the design of the system and the collection of data, will then have to be applied. The scope of this sheet covers all the cases mentioned above where the annotation relates to or contains personal data.

Examples of annotations:

In order to train a speaker recognition AI model integrated into a voice assistant, voice recordings are annotated with the identity of several speakers;
In order to train a fall detection AI model integrated into the video surveillance system of a nursing home, images are annotated with the position of the persons represented according to several labels such as “standing” or “laying down”;
In order to train an AI model for the recognition of license plates embedded in an access gate to a private space, images are annotated with the position of pixels containing license plates;
In order to train an AI model for predicting the risk of a certain pathology, intended to be used as a diagnostic aid by healthcare staff in a hospital, the blood results of patients are annotated with the diagnosis made by a doctor on the pathology in question.

The stakes of annotation for people’s rights and freedoms

When it relates to personal data, the annotation must be done in compliance with the GDPR. It generally aims to the same purpose as the one defined for the processing and must comply with the principles laid down by the GDPR. In view of the risks for individuals in both the development and deployment phases, the CNIL wishes to draw the attention of stakeholders using annotation to the principles of minimisation, accuracy and loyalty.

The principle of minimisation

Minimisation consists of processing only data “adequate, relevant and limited to what is necessary in relation to the purposes for which they are processed” (Article 5.1.c GDPR). In practice, this means that the annotations, as much as the training data to be annotated, must be limited to what is necessary for the training of the model, as described in the sheet “Taking data protection into account in data collection and management”.

Annotations containing information that are not relevant to the intended purpose do not comply with the principle of minimisation. In some cases, information indirectly related to the purpose may be useful for improving the performance of the model (such as images of billboards for training a traffic sign identification model, as this addition should allow better avoidance of certain false positives). Information is relevant if its link to model performance is proven (theoretically, or empirically, notably in scientific publications) or sufficiently plausible.

Annotated datasets used by the developer of an AI system following a previous collection, purchase or download (from an open source or not) should contain only annotations relevant to the functionalities of the system. Where this is not technically possible, the controller must be able to justify it and try to use the most relevant annotated dataset. A triage of the data must then be carried out to limit the retained annotations to those that are relevant.

Example: Annotating an image with the profession of the represented individual will not be useful for an algorithm aimed at detecting a person’s presence. On the other hand, annotating the position of its members may be a useful indication for the model that might be able to detect partially occulted people.

Where the activity of an organisation consists of setting up training datasets for the service of third parties, two situations must be distinguished:

either, as recommended, the training set is created or configured specifically for the customer’s needs; the provider of the training dataset is then subcontracted to its customer for the application of the GDPR; the subcontractor must ensure that the dataset contains only relevant annotations;
either the supplier makes available training sets already constituted; it must then have designed his product in such a way as to enable compliance with the principle of minimisation; a solution may be to provide for several categories of annotations, separable or cumulative, or to offer several distinct sets according to the types of annotation.

In any case, the training datasets must have been constituted and made available in compliance with the GDPR.

The CNIL’s recommendations on data minimisation made in the data sheets “Take data protection into account in the system design choices” and “Taking data protection into account in the collection and management of data” are applicable.

Annotations may include contextual elements that are useful for measuring and correcting errors and biases. In the case of probabilistic systems such as the majority of AI systems, performance management relies on the ability to measure and correct errors and biases most likely to impact system efficiency. It may therefore be relevant to annotate training, testing or validation data with context elements (date and time, weather, etc.), in particular to measure possible performance deviations depending on the situation. On the other hand, particular attention must be paid to annotation with personal data such as a person’s name or sensitive data (religion, skin colour, etc. – see below).

Example: an automatic image analysis system is driven from camera data in different locations at times where people visit them. In order to measure its performance, the designer annotates the data with the following information: day/night, type of habitat filmed, weather.

The principle of accuracy

Accuracy requires for the data to be accurate and, where necessary, kept up to date. This principle implies that the annotation must contain only accurate information about the person to whom the data correspond. An incorrect annotation or one that is based on inappropriate or arbitrary criteria will not respect the principle of accuracy. In practice, this means that the developer will have to take appropriate steps to ensure that the annotation criteria are objective. This issue appears all the more important because the annotation is usually a single word, or a short expression, which is not enough to fully describe a person. This annotation has a risk of being perceived as degrading by people who should not be overlooked, especially since the system could reproduce the inaccuracies of the annotation later, during the deployment of the system and lead to inaccurate, even degrading or discriminatory exits.

Example: Annotating a set of images with the profession of the persons represented could lead to inaccurate annotations if the job categories used in the annotation are too specific. Thus, an image on which a person in a white coat is represented without further information could just as well be a doctor or a nurse and assigning it the annotation “care giver” would be more respectful to the principle of accuracy.

Other principles that are less specific to annotation, such as loyalty, transparency, confidentiality or integrity, also apply.

Ensure the quality of the annotation

The CNIL invites stakeholders to implement the following measures:

Defining an annotation protocol, in accordance with the principles of accuracy and minimisation. The CNIL recommends following the following steps:
- The choice of annotation labels. They should be adapted to the intended purpose for the deployment of the system, and limited to information relevant for training. Although this purpose is not always known precisely when designing the model, in particular for foundation models, the labels chosen should correspond to the expected functionalities at the end of the learning process. For more information on the purpose of foundation models and general purpose systems, please refer to the “Defining a purpose” sheet. In addition, they must allow an objective and unambiguous annotation. Since these labels serve to characterise a person’s data, their choice must be made in a fair manner towards persons whose data are annotated and in particular exclude any degrading, depreciative or term and value judgement that may damage the reputation of persons. Since annotations may serve as proxy values for other information about the person, such as special categories of data, particular attention must be paid to the fact that it does not lead to the unintended introduction of bias, and possible discriminations, into the system. In the event that the annotation results from an existing business process, a phase of triage or requalification of labels may be recommended in order to limit the annotations to what is necessary and relevant for the training of the AI model.
- The definition of an annotation procedure. It should:
  - be documented;
  - provide for clear assignment of tasks, thus limiting access to data only to authorised persons;
  - allow persons performing the annotation to provide feedbacks on the annotation protocol, and in particular on labels and data, in order to identify when the protocol can be improved or is inappropriate;
  - include a validation phase to confirm the choice of labels and the functioning of the procedure, during which, for example, the inter-annotative agreement will be evaluated when several persons perform the annotation. An in-depth analysis of a random sample of annotated data will identify certain errors, inaccuracies, or recurrent inaccuracies during this phase;
  - be tracked by logging changes, or using a versioning tool;
  - rely on a reliable, robust and controlled annotation tool. Many annotation tools, often specific to certain types of data (images, text, sound, tabular data) exist; it is recommended to check their safety and relevance to the intended purpose, in particular when they incorporate a semi-automatic annotation feature. Etalab’s guide to preparing and conducting its annotation campaign proposes several criteria for selecting the most appropriate text annotation software for its situation (some of these criteria remain relevant for the annotation of other types of data).
    
    In the event that the annotation results from a business process, the latter must incorporate the previous recommendations and make the annotation of the data a fully-fledged objective.
  - The definition of a continuous verification procedure. This procedure to control the quality of the annotation should be implemented shortly after the start of the annotation phase and then continue through regular or continuous checks. It should be documented and may, for example, be based on:
    - discussion panels including the annotation team, the system development team, and system users when known;
    - the analysis of random samples of annotated data;
    - an internal or external audit;
    - an analysis of the relevance of annotations for each new use case requiring training data (a set of images created to train a vehicle recognition algorithm should be reviewed before being used for pedestrian detection, for example);
    - a procedure for taking into account feedback from users of the dataset or model trained on the quality of the annotation and on the corrections to be made;
    - the quality control procedure provided for in business processes, which will have to be adapted in order to include the quality of the annotation as an objective in its own right.
Involve an ethics referent or committee, as a good practice, upstream and then throughout the annotation phase. The multidisciplinary and objective nature of this committee will enable to:
- Choose the best option for data annotation, be it through in-house processing, subcontracting (and the choice of the subcontractor), or using a solution that does not require annotation (use of an existing set or synthetic data for instance);
- Establish an annotation protocol, in particular to select and define the labels used for the annotation;
- Verify the application of the annotation protocol;
- Monitor the quality of the annotations and their suitability for the intended task in the deployment phase.
  
  In each of these tasks, the following detailed recommendations should be taken into account by the ethics committee. Best practices concerning the objectives and composition of this committee are to be found in the sheet “Taking data protection into account in the design choices of the system”. It should be noted that the setting up of an ethics committee must be adapted to the structural constraints of the organisation. For less resourced structures, an ethics referent will be able to play the role of the committee;

In the case of an annotation resulting from a business process, these measures must be integrated into that process (for example, if information collected as part of the business procedure, such as a medical diagnosis, is subsequently reused for training).

Information and the exercise of rights

Persons must be informed of the annotation operations

The information of the individuals whose data are collected, whether individual or collective, must mention the annotation phase of the data. In addition to the information to be provided in accordance with the GDPR, it is recommended, as a good practice, to increase transparency by providing the following information:

the purpose of annotation, such as identifying people in an image, or matching a patient’s medical diagnosis with their consultation notes.
the organisation in charge of the annotation, whether it is a team formed by the controller, a processor, or a community of collaborators. In the case of the use of a subcontractor whose teams are located outside the European Union, the information must specify the existence of transfers outside the EU. The use of a processor must also be subject to contractual clauses such as those proposed on the CNIL's “Standard contractual clauses between controller and processor” web page.
the criteria of corporate social responsibility met under the contract linking the persons responsible for the annotation to the controller, such as guarantees concerning working conditions, remuneration, or psychological support where the annotation relates to data that may contain shocking content.
the security measures taken, and in particular those concerning the annotation phase.

Once the annotation has been completed, and where it is possible to inform the persons concerned a posteriori, they may be informed of the results of the annotation and in particular of the label which is assigned to their data with a view to transparency. This may be a good practice in some rare cases, in particular when:

the annotation is likely to have consequences for people, which may be the case when its data represent the whole or a significant part of the training dataset. This may be the case when a person’s data is used to fine-tune the model for its particular use on the basis of an annotated sample.

Example: For the parameterisation of a tool to help a household to reduce its gas consumption by automated analysis, allowing the users of the tool to confirm that the installations (such as heating, water heater or gas oven) detected are the right ones can allow a better analysis.
the annotation is likely to have consequences for individuals, for example where unintentional disclosure of the data could damage the reputation of individuals.

Individuals must be able to exercise their rights over annotations

The rights may be exercised to labels associated with a person’s data when the derogations provided for in the texts (the GDPR and the law “Informatique et Libertés”) do not apply, as described in the sheet dedicated to the exercise of rights . Indeed, the annotation attributed to personal data may in many cases also be considered personal data. It follows that:

the right of access applies to the annotation: the information provided following a request for a right of access must contain the annotations attributed to the person’s data;
the rights to rectification, erasure (in particular following withdrawal of consent), opposition, and limitation apply to annotations. Where these rights are exercised, the same processing must be applied to the data concerned and their annotation.
the right to portability applies to the annotation only where it has been provided by the person and the processing is based on the legal basis of the consent or on the basis of the contract;

Examples:

In the case of medical results obtained after a blood test and collected following the patient’s consent, measurements (such as blood sugar) may be training data and results (such as diagnosing diabetes), annotations. In this scenario, the right to portability applies to measurements and results, if the set was specifically provided by the patient for research purposes.

In the case of sound recordings obtained during the use of a tool such as a voice assistant where people have consented to share their data with the controller for the improvement of their language comprehension algorithm, the annotations (containing for example the analysis of the content of the recording) are provided by the controller’s teams. In this scenario, the right to portability does not apply.

Distinguish annotation, profiling and automated decision-making

Although the annotation consists of assigning one or more characteristics to the data of the person who may thus constitute a profile, it is generally not a profiling defined in Article 4 GDPR or an automated decision within the meaning of Article 22 GDPR.

Indeed, the profiling referred to in the definition in paragraph 4 of Article 4 must result from automated processing, the purpose of which is to assess personal aspects, in particular to analyse or make predictions about the data subjects. The use of the word “evaluate” suggests that profiling involves some form of appreciation or judgment towards a person. In the vast majority of cases, annotating consists in a classification in order to serve as a “ground truth” for the model that will have to learn to process, classify, or discriminate data based on this information. The aim is generally not to assess by judgment individual characteristics and therefore this does not constitute profiling.

Furthermore, automated decisions referred to in Article 22 of the GDPR, which may include profiling, must have legal effects concerning the person or significantly affect the person. The annotation of a person’s data for training will generally not have an impact on the data subject at this stage of data processing.

Thus, annotation will only rarely be considered profiling, and it will generally not fall within the scope of Article 22 GDPR, unlike the outputs of the AI system in the deployment phase, which can often be considered as exclusively automated decision-making.

Annotation from sensitive data

The annotation can sometimes reveal special categories of data (ethnic origin, data on the health of the data subjects, political or trade union opinion, etc.) without the source data being itself a special category of data. The processing of this data is prohibited in principle by Article 9 GDPR; however, some exceptions exist. The organisation responsible for the annotation processing will have to identify one of these exceptions in order to be able to implement it legally.

Example: as part of a study on the spread of disinformation online, researchers are collecting publicly accessible publications on social media. The role of some users who have clearly made public their political affiliation on their public profile is analyzed in this study, and this membership is used to annotate publications and train an AI model. Political opinions are covered by Article 9 of the GDPR, insofar as these data have been manifestly made public, this processing is not prohibited.

This kind of annotation is therefore not impossible but is subject to special provisions which must be complied with. In the case of health research projects on data collected during care, for example, the exceptions provided for in Article 44-3 of the French Data Protection Act and 9-2-j of the GDPR may apply. The mobilisation of these exceptions and the completion of one of the formalities provided for in Article 66 of the French Data Protection Act, such as a commitment to comply with a reference methodology, or an application for authorisation granted by the CNIL, will allow the processing of annotated health data for the development of an AI system.

In view of the risks that the processing of these data entails for individuals, such as the risk of discrimination, the CNIL recommends using as much as possible other categories of data, such as synthetic data.

Where the applicable provisions are fulfilled and the treatment is lawful, special measures must nevertheless be taken in view of the increased risk to persons. In particular, the CNIL recommends the following measures:

Annotate according to objective and factual criteria (such as the measurement of skin colour according to the RGB system rather than annotation of the ethnic origin of the person represented in a picture), which can be permitted by the use of technical annotation tools leaving no room for the annotator’s interpretation;
Limit the annotation to the context of the data by avoiding drawing conclusions beyond the information present in the data;
Strengthen the verification stage of annotations, in particular with regard to their regularity (e.g. by a higher frequency), their completeness (e.g. by analysing a larger volume of data), or checking their effectiveness (e.g. by logging the results of the checks, or by an external audit of the procedure). This verification step seems particularly crucial when automatic or semi-automatic annotation tools are used;
Increase the security of annotated data by performing in-house annotation processing, processing data locally, and ensuring their security through encryption, logging, and stronger access restrictions;
Study the risk of regurgitation and inference of special categories of data on models trained from them. Where the controller of the annotation of the dataset does not develop a model but merely makes the data available to other bodies, it should encourage them to conduct this analysis on the models they develop. The CNIL wishes to interrogate the organisms concerned by this matter on the cases where these risks are the most important and on the measures that help in reducing them through a dedicated questionnaire (whether special categories of data are processed or not). The resulting recommendations will be published after the consultation phase.

The topic of using sensitive data for the management of discriminatory biases is a crucial issue in artificial intelligence and will be the subject of a dedicated practice sheet.

< Previous : Respect and facilitate the exercise of data subjects’ rights

Table of content

Next : Ensuring the security of an AI system's development >