Where Does Your Data Go When You Use AI?

·4 min read·ai, privacy, compliance, lexbox
ShareXLinkedIn

I'm moving to a new company. The department I'm joining needs to keep data security airtight. So I did what anyone would do - I sat down and actually read the privacy policies of every major AI provider to understand what happens to the data we put in.

What I found wasn't great.

Most major AI providers train on your data by default

That's the headline. If you're using the free or paid consumer version of most AI tools - ChatGPT, Gemini, Copilot - your inputs are being used to train future models. Not maybe. Not sometimes. By default.

Here's the breakdown:

ProviderConsumer tierAPI tierEnterpriseOpt-out
OpenAI (ChatGPT)Trains by defaultNo trainingNo trainingSettings toggle, but only for future chats
Google GeminiTrains by defaultNo training (Vertex AI)No training (Workspace)Must disable Gemini Apps Activity - loses chat history
Microsoft CopilotTrains by defaultNo training (Azure OpenAI)No training (M365)Settings, except in EEA
Meta AITrains by defaultN/A (open-weight Llama)N/AObjection form, incomplete
MistralTrains by default (free tier)No training (paid API)No trainingAccount settings
PerplexityTrains by defaultNo training (zero retention)No trainingAccount toggle
Anthropic (Claude)Does NOT train (opt-in since Aug 2025)No trainingNo trainingN/A - default is already private
xAI (Grok)Does NOT train (opt-in)Not documentedNot documentedN/A standalone, but Grok on X trains by default

Two things stand out.

First, every single provider exempts their API and enterprise tiers from training. The difference isn't the AI - it's how you access it. Consumer chat equals training data. API access equals private.

Second, the opt-out mechanisms are not equal. Some are worse than others.

The opt-out trap

OpenAI lets you toggle off training in settings. Fair enough. But here's what most people miss - it only applies to future conversations. Everything you've already sent is gone. It may have already been used.

Google is worse. If you want to opt out of training on Gemini, you have to disable "Gemini Apps Activity." That also kills your entire chat history. You can't keep your history and opt out of training. There's no middle ground.

And then there's a detail that surprised me. Google's privacy policy says human reviewers can read your Gemini conversations. Those reviewed conversations are stored for up to 3 years.

Meta doesn't even pretend to give you a full opt-out. Since December 2025, they use AI chat interactions for ad personalization. You can object to model training through a form, but the ad personalization piece has no complete off switch.

Why this matters beyond privacy

This creates a centralisation problem.

When big AI companies train on billions of user interactions, their models get smarter. That's obvious. What's less obvious is the second-order effect - smaller, open-source models don't have access to that data. They're training on public datasets while the big players are training on real-world business conversations, legal documents, financial reports, internal strategies.

Over time, this could widen the gap. The rich get richer. The open-source alternatives that many companies prefer for control and transparency might fall further behind - not because they're technically worse, but because they're data-starved.

What you can actually do

Check your settings today. In ChatGPT, go to Settings, Data Controls, and turn off "Improve the model for everyone." In Gemini, decide if losing history is worth the opt-out. In Copilot, check your data sharing preferences.

Ask your IT team one question: Are we on a consumer tier or an enterprise/API tier? If nobody knows the answer, that's your answer.

Use private or incognito modes for anything sensitive. ChatGPT has Temporary Chat. Claude has Incognito mode. Neither of these get used for training regardless of your settings.

If you're building a product that uses AI - and this is what I did with LexBox - use API access exclusively. Legal documents have no business ending up in anyone's training pipeline.

The numbers that should worry compliance teams

The EU AI Act's transparency obligations for general-purpose AI models became legally binding in August 2025. Providers must now publish detailed summaries of their training data. Penalties go up to 15 million EUR or 3% of global turnover.

The lawsuits are piling up. The New York Times forced OpenAI to preserve 20 million chat logs as evidence in their ongoing case. Anthropic settled for $1.5 billion over training data sources. Google paid $1.4 billion in Texas over biometric and location data. Over 70 AI-related infringement cases were filed by the end of 2025.

The bottom line

Your data is your responsibility. AI tools are useful - I use them every day. But the default settings are not on your side. The gap between "I use ChatGPT" and "my company uses AI responsibly" is exactly one settings page and one conversation with your IT department.

Have that conversation.


Research based on privacy policies of OpenAI, Anthropic, Google, Microsoft, Meta, Mistral, Perplexity, and xAI as of April 2026. Policies change frequently - verify before making compliance decisions.

Currently building: Klapa.hr and LexBox. Interested? Get in touch

M

Max Mucko

Entrepreneur and builder based in Croatia. Writing about health, tech, and building in public.