CNIL's Q&A on the Use of Generative AI Systems

18 July 2024

Many organisations are considering deploying or using generative AI systems and are seeking guidance on the necessary measures to put in place. This Frequently Asked Questions (FAQ) offers initial answers to their queries.

Generative” artificial intelligence refers to the class of systems capable of creating content (text, computer code, images, music, audio, videos, etc.). These systems can be classified as general purpose AI systems when they can perform a whole range of tasks. This is especially true for systems based on large language models (LLMs). Designing these systems requires vast amounts of data from diverse sources (Internet, licensed third-party contents, conversations generated by human trainers, user interactions, synthetic data, etc.).

To learn more about the compliance requirements for developing these systems when they involve personal data, the CNIL provides recommendations on how to design them in compliance with the GDPR.

1. What are the benefits of generative AI?

2. What are the limitations and risks of generative AI systems?

3. What approaches are available today to use generative AI (off-the-shelf models, fine-tuning, RAG, etc.)?

4. How to choose your generative AI system?

5. Which deployment method should be preferred (on premise, API, cloud)?

Generative AI systems can be delivered on-premise, on a hosted cloud infrastructure or on-demand via Application Programming Interfaces (APIs). The choice of a deployment mode depends on the use case and the input data.

For non-confidential uses, relying on a consumer service to which access is permitted through dedicated business e-mail addresses may be considered with appropriate safeguards (provided that account creation with personal e-mail addresses is avoided and, where appropriate, the possibility for the system provider to reuse usage data is disabled).

If the intended use consists in providing personal data (e.g. customer or employee data) or sensitive or strategic documentation (e.g. for RAG), it generally seems more appropriate and secure to favour the deployment of on-premise solutions which limits the risks of data extraction from a third party (in particular as detailed in ANSSI’s security recommendations for a generative AI system).

However, given the cost of setting up and operating an on-premise system, it will often be easier to use a system hosted in a remote infrastructure (cloud or off-premise). In this case, it will be necessary to secure the use of external infrastructures through a contract with the system host and, where applicable, with the AI system provider. This contract must clearly specify the scope of responsibility and authorised access to the data processed. In this regard, there is nothing to prevent small companies or local authorities with limited resources from using a shared infrastructure, provided that data transfers are supervised and their security is ensured

In addition, if the personal data may be transferred outside the European Union (for example because the hosting infrastructure is located outside the EU or is operated by a non-European provider), the user entity must, in addition, supervise the processing of their recipient.

Finally, if using a system through APIs is considered, it must be pointed out that the control of the system is then almost exclusively in the hands of its supplier. Therefore, particular attention should be paid to the data submitted to the AI system, avoiding as far as possible inputs containing personal data. Particular attention should also be paid to contractual conditions, especially to issues of data transfers outside the EU.

6. How should a generative AI system be implemented and managed?

Prior risk analysis and clear governance strategies must be considered before implementing this type of system. In particular, the deployer must ensure their compliance with the GDPR.

Regardless of the intended usage, the deployer must particularly:

Question its role and the role of the provider with regard to the processing of personal data, by entering into a contract or a joint liability agreement with the latter if necessary, or even by supervising the transfer of data outside the European Union as mentioned above;
Ensure the security of the data it provides for the development of the system (e.g. for a tailor-made or fine-tuned model) or during the deployment of the system (e.g. by deciding to object to the reuse of usage data by the system provider, or even the recording of history).

For more information: see ANSSI’s security recommendations for a generative AI system.

More generally, it is recommended to regulate the use of a generative AI system through internal policies or charters, clearly defining the authorised and prohibited uses.

Depending on the deployment method, it will be necessary to prohibit the provision of certain confidential (e.g. covered by industrial and commercial secrecy) or personal data.

For example, where the data is likely to be reused by the provider (in accordance with its terms and conditions), the organisation deploying the system will have to conduct a case-by-case analysis to determine whether or not it should prohibit the provision of any personal data, or only certain categories.

Conversely, such prohibitions are unnecessary for on-premise deployment where data re-use by the provider is not feasible.

An intermediate case may be to allow users to provide freely accessible data to such systems. However, the re-use of such freely accessible data by the AI system provider will not always be possible. The latter will have to conduct a dedicated analysis.

In addition to the need for intelligibility and accessibility of these documents, it is recommended to train users appropriately and to carry out regular checks.

7. How can end-users of these systems be trained and made aware of the risks?

In the general case, the organisation deploying the system will bear the legal liability in case of misuse of AI by its staff. It is therefore recommended to familiarise end-users with the functioning and limitations of these systems (how outputs are produced, possible data transfers, etc.), as well as authorised and prohibited uses.

End-users should be made accountable for their use of these tools by encouraging them to verify input data and output quality (e.g. through mandatory training prior to the use of these systems).

End-users should only submit information that they are allowed to share in the prompt or input data. For example, they should never share confidential information such as personal data, company or administrative data (especially when covered by a secret such as business secrecy or ethical obligations) when using a consumer service. The organisation could even provide input templates or pre-prompt directly accessible in the tool to encourage end-users to perform well-identified and proven tasks.

With regard to outputs generated by the system, end-users should always approach them critically and verify that:

they are accurate or of good quality (especially in the absence of sources, e.g. through counterfactual research);
it is not plagiarism (e.g. in case of high suspicion of regurgitation of a protected work, trying to find the initial source);
that they do not introduce biases that could lead to discrimination (for example, by reproducing a stereotype).

Finally, they should be alerted on the risk of loss of confidence or competence associated with excessive use of the system (also known as automation bias). In order to avoid any loss of human control, it is a good practice to never reproduce the outputs of these systems as such, including when they incorporate filters that are supposed to prevent the generation of inappropriate content (since these systems are never flawless).

In this regard, integrating generative AI bricks into an information system (IS) should be designed to clearly inform users of the origin of the proposal, reminding them that it is up to them not to take the proposal “as is”. For example, the system can provide at least two answers for each question, allowing the human operator to select the relevant elements of each answer to consolidate its own.