What Counts as Personal Data in LLM Systems
Under EU GDPR, personal data is any information relating to an identified or identifiable natural person, including indirect identifiers and “online identifiers” (GDPR Art. 4(1)). gdpr-info.eu
In an LLM (“GDPR AI”) setup, personal data can appear in more places than teams expect:
Prompts, system messages, and user inputs
Free-text prompts often include names, emails, order IDs, ticket transcripts, employee HR context, or customer complaint details. Even if you didn’t intend to send personal data, a user can paste it into an automation (e.g., n8n) or customer-facing chat.Conversation history and chat/session logs
“Chat history” is simply a record of personal data if the conversation contains it. Providers may keep logs for defined purposes (e.g., abuse monitoring), and some products allow configurable retention. For example, OpenAI describes retention controls for enterprise workspaces and indicates deleted conversations are removed within 30 days (unless legally required). openai.comVector embeddings and re-identification risk
Embeddings are typically derived from text. GDPR risk hinges on identifiability: if an embedding can be linked back to a person (directly or via additional information), you should treat it as personal data. This aligns with GDPR’s broader logic that “pseudonymised” data can remain personal data when re-attribution is possible, and that pseudonymisation is a risk-reduction measure—not an exit from GDPR. edpb.europa.euGenerated outputs that include or infer personal data
Outputs can reproduce personal data supplied in the prompt, personal data pulled in via RAG, or personal data inferred from context (“the customer who complained yesterday…”). Outputs are part of the same processing activity and must be covered by purpose, lawful basis, security, and retention decisions.The misconception: “LLMs don’t store data”
“Stateless model” does not mean “no data is stored anywhere.” Even where a model is described as stateless, providers may still process and store certain data in service layers (e.g., stored conversation state, files, threads, safety logs). Microsoft notes Azure Direct Models are “stateless” and that prompts/completions are not stored in the model—but the same documentation also describes service features that can store data and an abuse monitoring data store (especially for Global/DataZone deployments). learn.microsoft.com
Lawful Basis for Processing Personal Data
GDPR requires at least one lawful basis under Article 6 for processing personal data. gdpr-info.eu For LLM use in internal tools, automations, and customer-facing products, teams usually evaluate:
Contractual necessity (Art. 6(1)(b))
Works when processing is objectively necessary to perform a contract with the user (e.g., drafting contractual text at the user’s request inside a paid SaaS feature). Overreach risk: “nice-to-have” analytics or broad prompt retention rarely qualifies as “necessary.” gdpr-info.euLegitimate interests (Art. 6(1)(f))
Common for internal productivity, fraud prevention, or support automation—but requires a balancing test and clear documentation of necessity and impact. UK/EU regulator guidance consistently emphasises choosing the basis that fits the reality of processing, not convenience. ico.org.ukConsent (Art. 6(1)(a))
Often high-risk in product contexts because consent must be freely given, specific, informed, and revocable. In employment/internal tools, power imbalance frequently undermines “freely given.” (This is a legal interpretation of GDPR consent standards; validate against your counsel and local regulator guidance.)Special category data (Art. 9)
If prompts or retrieved documents include health data, biometrics, political opinions, etc., Article 9 restrictions apply and you need both an Art. 6 basis and an Art. 9 condition. gdpr-info.eu+1Controller vs processor role separation
Your organisation often acts as controller for user/customer data in prompts and in your product. LLM vendors commonly position themselves as processors (under a DPA) for customer content in business offerings—but you must confirm this for the exact product tier and integration.Example: OpenAI’s DPA explicitly states OpenAI acts as a Data Processor on the customer’s behalf for “Customer Data” processed to provide the services. openai.com
Google’s Gemini API terms distinguish “Paid Services” and state Google processes prompts/responses under a “Data Processing Addendum for Products Where Google is a Data Processor.” Google AI for Developers
Data Minimisation and Purpose Limitation
GDPR requires you to define purpose up front and limit processing to what is necessary for that purpose (purpose limitation and data minimisation principles).
Where LLM deployments fail in practice:
Free-form prompts encourage over-collection
If staff paste entire emails, CRM records, or HR notes “to get a better answer,” you may collect more personal data than needed for the task.Context length, memory, and persistence must be intentionally designed
If you enable conversation memory, store threads, or keep response state for debugging, you’re expanding the amount of personal data processed and retained. That can be legitimate—but only if it’s necessary and documented.Secondary use risk: training, analytics, monitoring
Even when providers state they do not train on business data by default, there may still be logging for abuse monitoring or operations.
For example, the Gemini API “abuse monitoring” policy states Google retains prompts/context/output for 55 days for abuse monitoring and that this abuse-monitoring data is not used to train/fine-tune models. Google AI for DevelopersPurpose drift in general-purpose AI deployments
A support summarisation tool becomes a “customer insights engine,” then becomes “model fine-tuning data,” unless your governance prevents it.
Concrete minimisation failures are highly fact-specific. If you want real examples in the article, they should be sourced from incident reports, regulator cases, or your internal postmortems—otherwise they risk becoming speculative.
Model Training, Prompt Retention, and Data Reuse
A GDPR-first evaluation separates three questions that are often blurred:
A) Is customer content used for training or improvement?
OpenAI (business products / API): OpenAI states that by default it does not train on inputs/outputs from business users (including ChatGPT Business, Enterprise, and the API), unless customers explicitly opt in. OpenAI Help Center+1
Microsoft Azure Direct Models: Microsoft states prompts/completions are not used to train, retrain, or improve the base models. learn.microsoft.com
Amazon Bedrock: AWS documentation states Bedrock doesn’t use prompts/completions to train AWS models and does not distribute them to third parties. AWS Documentation
Gemini API: For “Paid Services,” Google states it doesn’t use prompts/responses to improve products and will process them under the relevant DPA; abuse-monitoring data is “solely for policy enforcement” and not for training/fine-tuning. Google AI for Developers+1
B) How long are prompts/outputs retained?
Retention must be mapped across:
provider logs (abuse monitoring / ops),
application state you enable (threads, stored responses),
your own storage (chat history, analytics, RAG caches).
Examples of explicit statements:
Gemini API: retains prompts/context/output for 55 days for abuse monitoring. Google AI for Developers
OpenAI API (application state): platform docs describe a 30-day application state retention for the Responses API by default (or when
store=true), and note that with Zero Data Retention enabled,storeis treated as false. OpenAI PlatformOpenAI (logs): OpenAI has stated that after 30 days, API inputs/outputs are removed from OpenAI logs (unless legally required to retain them). openai.com
Azure Direct Models: Microsoft documentation describes an abuse monitoring data store (notably for Global/DataZone) but the public documentation excerpted here does not state a specific retention duration for that store. learn.microsoft.com
C) Right to erasure vs trained models
If personal data is used to train a model, erasure becomes complex because training transforms and disperses influence across model weights. This is a legal/technical tension teams should surface early in DPIAs—especially if you are considering fine-tuning with real personal data.
D) Fine-tuning with real personal data
Microsoft states training data uploaded for fine-tuning is not used to train foundation models without permission/instruction, and that fine-tuned models are exclusive to the customer and deletable. learn.microsoft.com
Gemini API terms mention tuning content retention associated with tuned models and deletion when the tuned model is deleted. Google AI for Developers
Even with these statements, fine-tuning on personal data remains high-risk and usually demands a DPIA and strict minimisation.
Data Residency, EU Hosting, and Cross-Border Transfers
“EU hosting” is not automatically “EU-only processing.” GDPR transfer risk depends on where personal data is processed and accessed, including by subprocessors.
Typical LLM data flows
user prompt → your app → LLM API endpoint
provider may route within regions/geographies for capacity and resilience
provider may log prompts/outputs for abuse monitoring or security
provider may use subprocessors to deliver parts of the service
your app may store chat history, embeddings, or analytics elsewhere
Schrems II implications and SCCs
Schrems II validated SCCs but requires assessing third-country law and adding supplementary measures where needed. EDPB recommendations provide a methodology for exporters to identify and implement supplementary measures. edpb.europa.eu+1
The European Commission also provides Q&A on SCCs as a transfer tool. European Commission
AWS’s GDPR Center explicitly references reliance on SCCs post–Schrems II for transfers outside the EEA. Amazon Web Services, Inc.
Why encryption alone is insufficient
Encryption reduces risk, but Schrems II/EDPB framing is about essentially equivalent protection considering access risks in third countries. Encryption can be a supplementary measure, but it does not automatically solve transfer legality—particularly if keys or access paths undermine protection. edpb.europa.eu
Provider-specific residency and routing considerations (high-level)
Microsoft Foundry Models (Global vs DataZone): Microsoft documentation says Global deployments may process prompts/responses “in any geography” where the model is deployed; DataZone deployments may process prompts/responses “within the specified data zone” (e.g., within EU member nations if the resource is in an EU member nation). Data stored at rest remains in the designated geography. learn.microsoft.com+1
OpenAI (Europe data residency): OpenAI states eligible API customers can create a Europe region project; requests are handled in-region “with zero data retention” (requests/responses not stored at rest). openai.com
Amazon Bedrock (cross-Region inference): AWS documentation describes “Geographic cross-Region inference” as keeping data processing within specified geographic boundaries (including EU) while routing across Regions for throughput. AWS Documentation+1
Gemini API (Paid Services): Google states Paid Services are processed under a DPA and are not used to improve products, but also says prompts/responses may be stored transiently or cached “in any country in which Google or its agents maintain facilities” for limited-time logging related to prohibited use policy and required disclosures. Google AI for Developers
From a GDPR-first stance: you must document the actual routing/processing geography, not only the marketing label.
AI Act vs GDPR: What Changes and What Doesn’t
The EU AI Act and GDPR regulate different things:
GDPR: governs processing of personal data.
AI Act: establishes a risk-based framework for AI systems placed on the market/put into service/used in the EU, with obligations that vary by system category (including high-risk systems and specific rules for general-purpose AI). eur-lex.europa.eu
Key GDPR-first takeaways:
AI Act does not replace GDPR. If your LLM use involves personal data, GDPR obligations remain.
Overlap exists (documentation, governance, risk assessment), but the triggers differ: GDPR triggers are about personal data processing; AI Act triggers are about the AI system category and use context. eur-lex.europa.eu
Provider vs deployer responsibilities split: AI Act assigns duties to providers and deployers depending on role; GDPR assigns controller/processor duties based on who determines purposes/means and who processes on behalf of whom.
Any combined compliance plan should avoid a common misconception: “AI Act compliance = privacy compliance.” It does not.
Most Popular LLM Providers — GDPR Comparison (Compliance-only, Source-backed)
Scope note (to avoid category errors): “Azure AI Foundry,” “AWS,” and “Gemini” each cover multiple products. The comparison below is limited to what is stated in the cited sources for:
Microsoft Foundry / Azure Direct Models deployment types and data privacy docs
OpenAI API + ChatGPT Business/Enterprise statements in OpenAI docs
Amazon Bedrock data protection docs
Gemini API terms/policies for Paid Services and abuse monitoring
Comparison table (facts + “insufficient public info” flags)
| Provider | Controller/Processor role clarity | Training use of customer content | Prompt/output retention | EU data residency / EU-only processing option | DPIA & audit support (as stated) | Sub-processor transparency |
|---|---|---|---|---|---|---|
| Azure AI Foundry (Azure Direct Models) | Microsoft points to its Data Protection Addendum as governing processing for Azure services; role specifics depend on that DPA. learn.microsoft.com | Prompts/completions “not used to train, retrain, or improve the base models.” learn.microsoft.com | Insufficient public info in the cited Foundry privacy docs for a fixed retention period for abuse monitoring store (docs describe it but don’t give a duration). learn.microsoft.com | Standard: within customer-specified geography (may process between regions within geography). Global: may process in any geography where model is deployed. DataZone: may process within the Microsoft-defined data zone (e.g., EU member nations). At rest remains in designated geography. learn.microsoft.com+1 | Not confirmed in the cited Foundry docs for DPIA/audit assistance terms (would require DPA excerpts). | Microsoft maintains subprocessor information for Microsoft Online Services. microsoft.com |
| OpenAI (API + ChatGPT Business) | OpenAI DPA states OpenAI acts as a Data Processor and describes assistance, audit, and subprocessor mechanisms. openai.com | By default, OpenAI states it does not train on business inputs/outputs (ChatGPT Business/Enterprise/API) unless opt-in. OpenAI Help Center+1 | OpenAI: API logs removed after 30 days (unless legally required). Responses API application state retained 30 days by default when stored; ZDR forces store=false. openai.com+1 | Europe residency: eligible API customers can choose Europe region; requests handled in-region with “zero data retention” for eligible endpoints. openai.com | DPA includes assistance for DPIAs and audit/inspection terms (with constraints). openai.com | OpenAI publishes a sub-processor list. openai.com+1 |
| AWS (Amazon Bedrock) | AWS DPA governs processor terms; AWS publishes subprocessor details under the AWS DPA framework. Amazon Web Services, Inc.+1 | Bedrock “doesn’t use your prompts and completions to train any AWS models” and doesn’t distribute them to third parties. AWS Documentation | Bedrock “doesn’t store or log your prompts and completions” (per cited Bedrock data protection doc). AWS Documentation | Cross-Region inference can be configured; “Geographic cross-Region inference” keeps processing within boundaries (including EU). AWS Documentation+1 | Not confirmed in the cited Bedrock pages for DPIA/audit assistance wording (would require DPA excerpts beyond what’s cited here). | AWS publishes a sub-processor page and DPA references. Amazon Web Services, Inc.+1 |
| Gemini API (Paid Services via Google Cloud billing) | Terms say Paid Services process prompts/responses under a “Data Processing Addendum for Products Where Google is a Data Processor.” Google AI for Developers | Paid Services: Google “doesn’t use your prompts or responses to improve our products.” Abuse-monitoring data is not used to train/fine-tune models. Google AI for Developers+1 | Abuse monitoring retains prompts/context/output for 55 days. Paid Services also log prompts/responses for a limited time for prohibited use policy/legal disclosures; terms state this data may be stored transiently/cached in any country where Google or its agents have facilities. Google AI for Developers+1 | Insufficient public info in the cited Gemini API docs for an “EU-only processing” guarantee; terms explicitly allow transient storage/caching in any country for certain logging. Google AI for Developers | Not confirmed in the cited Gemini API docs for DPIA/audit assistance wording (would require DPA excerpts). | Google Cloud offers a Cloud DPA framework; subprocessor transparency exists in Google Cloud documentation, but exact subprocessor list references for Gemini API specifically are not established in the cited sources here. Google Cloud |
Important: this table is deliberately conservative. Where sources are silent or ambiguous, it says so. That is the safest posture for privacy data protection decisions.
GDPR-First Deployment Checklist + Summary Table
Deployment checklist (practical, controller-led)
Purpose definition (purpose limitation)
Write a one-sentence purpose per LLM feature (“summarise support tickets to speed resolution”).
Ban secondary use by default unless explicitly approved.
Lawful basis selection (Art. 6)
Document the chosen basis (contract or legitimate interests are common).
If legitimate interests: complete a balancing test (keep it on file). gdpr-info.eu+1
Data classification
Define what can be sent to the LLM (no special category data unless explicitly authorised).
Put guardrails into UI/automation templates (n8n nodes, internal copilots).
Region selection + data flow validation
Confirm where inference is processed (standard vs global routing; cross-region inference profiles; data zones).
Treat “EU hosting” as a hypothesis until proven by documentation and contract. learn.microsoft.com+2AWS Documentation+2
DPA + sub-processor review (Art. 28)
Ensure your contract includes required Art. 28 clauses and subprocessor rules. gdpr-info.eu
Subscribe to subprocessor change notifications where available. openai.com+2Amazon Web Services, Inc.+2
Retention & deletion design
Decide what you store vs what the provider stores.
Prefer configurations that minimise provider-side retention where business requirements allow (e.g., ZDR / no stored state), and validate against documented behaviour. OpenAI Platform+1
Access control, logging, and security
Apply least privilege, rotate keys, isolate tenants/projects, monitor data exfiltration paths.
Remember Schrems II: technical controls help but don’t eliminate transfer analysis. edpb.europa.eu
Data subject rights handling
Build a process for access, deletion, and objection requests that covers:
your stored chat history/embeddings,
provider logs where controllable,
third-party systems connected via RAG.
If your provider DPA describes DSAR handling support, map it. (Example: OpenAI DPA includes DSAR assistance language.) openai.com
Final summary table (GDPR risk areas)
| GDPR risk area | What to verify | Why it matters |
|---|---|---|
| Training/data reuse | Is customer content used to improve models? Opt-in/opt-out? | Purpose limitation, transparency, erasure practicality |
| Retention | Provider log retention + your app retention + stored state | Minimisation, storage limitation, DSAR feasibility |
| Transfers | Processing geography, cross-region routing, transient caching, subprocessors | Schrems II compliance, SCC + supplementary measures |
| Roles & contracts | Controller/processor, Art. 28 clauses, audit, assistance | Accountability and enforceability |
| Security | Encryption, access controls, isolation, monitoring | GDPR security (Art. 32) risk reduction |
| High-risk processing | DPIA triggers (new tech, scale, sensitive data, monitoring) | Art. 35 DPIA obligations gdpr-info.eu+1 |
