AI Chatbot Data Collection 2025: What They Store

The Default Setting: Your Data Trains Their Models

ChatGPT, Claude, and Gemini all use your conversations for AI training by default. Unless you dig through settings and opt out, every question you ask, every document you upload, every secret you share becomes training data.

That medical question? Training data. That legal document you uploaded? Training data. That conversation about your relationship problems? Training data.

A Stanford study warns: "If you share sensitive information in a dialogue with ChatGPT, Gemini, or other frontier models, it may be collected and used for training, even if it's in a separate file that you uploaded during the conversation." ^[1]

The Privacy Scorecard: Who's Worst?

ChatGPT (OpenAI)

Training by default: Yes

Data retention: Indefinite

Can opt out: Yes (buried in settings)

Human review: Yes

GDPR compliant: No ^[2]

Claude (Anthropic)

Training by default: Yes (changed 2025)

Data retention: Up to 5 years

Can opt out: Yes

Human review: Yes (de-identified)

GDPR compliant: Partial

Gemini (Google)

Training by default: Yes

Data retention: 18 months default

Can opt out: Yes (complex)

Human review: Yes

GDPR compliant: No ^[3]

Copilot (Microsoft)

Training by default: Opt-in (consumer)

Data retention: Varies

Can opt out: Yes

Human review: Limited

GDPR compliant: Yes (enterprise)

ChatGPT: The Data Vacuum

OpenAI's ChatGPT is the worst offender for casual users.

What They Collect

Every conversation, stored indefinitely
Uploaded documents, images, spreadsheets
Your IP address and approximate location
Device information and browser type
How you use the interface (clicks, time spent)

The 2024 Policy Change

OpenAI pulled a fast one. In 2024, they removed the option for free and Plus users to disable chat history. Now all your prompts are retained indefinitely unless you manually delete them. ^[4]

Enterprise and Team subscribers can still opt out, with data purged after 30 days. Everyone else? Your data lives forever.

The Operator Problem

Using ChatGPT's "Operator" feature to browse the web? Screenshots and browsing activity persist for 90 days after deletion for "abuse monitoring." ^[4] Delete all you want, they're still watching.

November 2025: The Mixpanel Breach

On November 9, 2025, analytics company Mixpanel discovered an attacker had accessed systems containing OpenAI user data. Names, emails, analytics information, exposed. OpenAI shut down its Mixpanel integration while investigating. ^[5]

This wasn't the first breach. Earlier in 2025, over 225,000 OpenAI credentials appeared for sale on the dark web, stolen by infostealer malware. ^[6] A threat actor claimed to have 20 million more.

How to Opt Out of ChatGPT Training

Open ChatGPT
Click your profile icon → Settings
Go to Data Controls
Toggle off "Improve the model for everyone"

Warning: This only applies to new conversations. Everything you said before opting out is already in their training pipeline.

Claude: The Privacy Retreat

Anthropic marketed Claude as the privacy-conscious choice. That's no longer true.

The Quiet Policy Change

In late 2024, Anthropic changed its terms of service: conversations with Claude are now used for training by default unless you opt out. ^[7]

This is a retreat from Claude's earlier stance as the privacy-first AI. The company still claims to be more cautious than OpenAI, opt-out is clearer, and flagged content is de-identified, but they no longer refuse training by default.

Data Retention: 5 Years

If you don't opt out, your data can be kept for up to five years. Deleted chats aren't used, but anything from before you changed settings might still be in training datasets. ^[7]

The Enterprise Exception

Like every AI company, Anthropic treats paying enterprise customers differently. API users and business accounts are shielded from training use. Only consumers get exploited by default.

How to Opt Out of Claude Training

Go to claude.ai
Click your name → Settings
Find Privacy section
Toggle off training data usage

Gemini: Google's Data Integration Machine

Google's Gemini has the most complex, and invasive, data practices.

The 18-Month Default

Google stores your Gemini conversations for 18 months by default. You can change this to 3 or 36 months, or turn it off entirely in Activity controls. ^[8]

But here's the catch: even with activity turned off, conversations are still stored for 72 hours. Reviewed chats are retained for up to three years. ^[8]

Human Reviewers See Your Chats

A subset of your conversations gets reviewed by actual humans at Google. They're supposed to assess if responses were "low-quality, inaccurate, or harmful." ^[3] In practice, this means Google employees reading your private questions.

July 2025: The App Access Expansion

Starting July 7, 2025, Gemini gained access to Phone and Messages apps, even if you have "Gemini Apps Activity" turned off. ^[9]

Google claims turning off activity still prevents training use, and data is deleted after 72 hours. But an AI with access to your call logs and private messages? That's a lot of trust to place in Google's "internal practices."

November 2025: The Gmail Panic

Rumors spread that Google was using Gmail data to train Gemini. Google called reports "misleading", the confusion came from a January 2025 update that split one settings toggle into two. Some users found settings had flipped back on. ^[10]

Whether that was a bug or dark pattern, users learned their email might be feeding the machine.

How to Limit Gemini Data Collection

Go to myactivity.google.com
Click Gemini Apps Activity
Toggle it off (or set shorter retention)
Delete existing activity

Note: This doesn't stop 72-hour retention or human review of flagged chats.

Microsoft Copilot: The Complicated One

Microsoft's Copilot has the most privacy-friendly defaults, for consumers.

Consumer Copilot

Training is opt-in, not opt-out. By default, chats are only used for "essential purposes" like bug fixes and abuse prevention. If you consent to training, personal identifiers are removed first. ^[11]

Your uploaded files are never used for training, regardless of settings. ^[11]

Microsoft 365 Copilot (Enterprise)

Enterprise users get the strongest protections:

Data is encrypted and never used to train foundation models
Prompts and responses aren't used for third-party training
Complies with GDPR, EU Data Boundary, and ISO/IEC 27018 ^[12]

The catch: you're paying Microsoft a premium for privacy that should be the default for everyone.

The Breach Record

AI chatbot security isn't theoretical. Here's what's already gone wrong:

OmniGPT (2025)

Hacker claimed to have breached the AI platform, exposing:

30,000 users' personal data
34 million lines of conversation logs
Uploaded files with credentials and API keys ^[6]

DeepSeek (January 2025)

Chinese AI chatbot suffered multiple attacks:

DDoS attack halted new registrations
Exposed internal database to public internet
Open ClickHouse instance caused massive leak ^[13]

ChatGPT Indexed by Google (2025)

Thousands of ChatGPT conversations became searchable on Google due to misconfigured noindex tags on share-link pages. ^[6]

Your "private" conversation might be one Google search away.

Samsung Leak (2023)

Samsung employees accidentally leaked confidential code and documents by pasting them into ChatGPT. Samsung banned generative AI tools company-wide. ^[6]

The Consumer vs. Enterprise Divide

Notice a pattern? Every AI company offers privacy, if you pay enterprise rates.

Consumer services operate under non-negotiable terms of service. Your data is the product. Enterprise platforms have legally binding Data Processing Addendums (DPAs). They sell privacy itself as the product. ^[7]

Same technology. Same company. Different rules based on how much you pay.

What They Won't Tell You

Training Data Is Forever

Even if you delete a conversation, the model weights derived from it persist. Your words become part of the AI itself. There's no "unlearning" your data once it's baked into the model.

Third-Party Sharing

OpenAI allows "authorized vendors" access to user data. User data can be shared with law enforcement or government agencies if required. ^[4] That medical question you asked might end up in a legal discovery request.

Inference Attacks

Stanford researchers warn about "inference attacks", extracting private information from AI models trained on user data. Even if your specific conversation isn't stored, the model might reveal patterns from your data when queried by others. ^[1]

How to Protect Yourself

Immediate Actions

Opt Out of Training (All Platforms)

ChatGPT: Settings → Data Controls → Disable "Improve the model"
Claude: Settings → Privacy → Disable training
Gemini: myactivity.google.com → Gemini Apps Activity → Off
Copilot: Already opt-in (but verify in settings)

Delete Existing Data

ChatGPT: Settings → Data Controls → Delete all chats
Claude: Delete conversation history in settings
Gemini: myactivity.google.com → Delete activity

Use Anonymous Access When Possible

DuckDuckGo's AI Chat (no account required, no logging)
Perplexity with incognito mode
Self-hosted models (Ollama, LM Studio)

Behavioral Changes

Never share sensitive information: Medical records, legal documents, passwords, personal identifiers
Assume everything is logged: Even "deleted" data may persist
Anonymize before sharing: Remove names, dates, identifying details
Use separate accounts: Don't link AI accounts to your primary email
Check settings regularly: Companies change policies without notice

For Sensitive Work

Run local models: Ollama, LM Studio, PrivateGPT: never leaves your machine
Enterprise accounts: If your employer pays, use their protected instance
Avoid uploads: Type information manually instead of uploading documents
VPN + fresh account: For maximum privacy, use a VPN and account not linked to your identity

The Uncomfortable Truth

Convenience vs. Privacy

Every AI assistant that "knows you" does so by collecting data about you. Every "personalized" response comes from surveillance. Every "improvement" to the model might include your private conversations.

The business model is simple: you get a free or cheap AI assistant. They get your data to train the next generation of AI, which they sell to enterprise customers who pay for privacy.

You're not the customer. You're the training data.