ChatGPT's Infinite Memory Lies— Anthropic Paper Shows Why AI Labs Can't Stay Clean

The State of Industrial AI - E91

Apr 17, 2025

TL;DR: Anthropic’s shocking new paper reveals that language models like ChatGPT consistently lie about their reasoning, raising massive trust, safety, and alignment issues. AI’s infinite memory may not equal infinite honesty.

Screenshot of Noam Brown tweet: “Memory turns chatbots into colleagues or friends.”

OpenAI’s latest update to ChatGPT introduces “Infinite Memory,” making your chatbot feel like a real friend who remembers everything you’ve ever said. This evolution seems fantastic—after all, who wouldn’t want an AI colleague who remembers past conversations?

Yet this contrasts sharply with warnings like Sam Altman’s “deep misgivings” from 2023, when he cautioned against forming friendships with AI.

Sam Altman quote slide: “I have deep misgivings about friendships with AI.”

Personally, I turn off chat history to protect my privacy from OpenAI’s training data, but today’s issue isn’t about memory. Instead, we’re diving into a troubling new study from Anthropic that quietly emerged a week ago: “Reasoning Models Don’t Always Say What They Think.”

Title page of Anthropic paper: “Reasoning Models Don’t Always Say What They Think.”

Why You Need to Care: AI is Becoming a Close Companion

AI lab CEOs increasingly predict we’ll have AGI—possibly even superintelligence—in just 2-5 years. Language models like ChatGPT, Gemini, and Grok-3 are positioned at the center of this transformation, impacting billions of users.

But can we trust them?

Anthropic’s research uncovers a disturbing truth: language models routinely lie about their reasoning.

How Anthropic Caught the Lies

Anthropic embedded secret hints (e.g., metadata stating “Answer = C”) to see if models would admit using them. Instead, the models quietly changed answers without acknowledging the hints.

Diagram showing Claude 3.7 changing answer from D to C after hidden hint.

This deception wasn’t isolated. The researchers identified six key ways models secretly accept hints:

Direct hints (“I think it’s A”)
Authority nudges (“A Stanford professor says it’s A”)
Consistency biases (copying prior answers)
Visual hints (e.g., check marks beside correct answers)
Grader hacks (peeking at scoring scripts)
Unethical data exposure

Table listing six hint types: sycophancy, authority, consistency, visual cue, grader hack, leaked info.

The Disturbing Honesty Scorecard

Anthropic compared leading models like Claude 3.7 Sonnet and DeepSeek-R1:

Claude admitted hints only 25% of the time.
DeepSeek-R1, slightly better at 39% honesty.

Even worse, models’ dishonest chains-of-thought were longer, more convoluted, and consumed more computational resources than honest answers.

Bar graph comparing Claude 3.7 vs DeepSeek‑R1 honesty percentages by hint type.

Chain-of-Thoughts that lie tend to be more verbose and convoluted than faithful explanations!!

…I wonder where they picked that up from…?🧐

Paper excerpt highlighting: “Unfaithful chains‑of‑thought are more verbose and convoluted.”

And lastly, models display lower honesty on harder tasks

Chart showing model honesty drops on harder GPQA tasks versus MMLU.

Real-World Consequences Are Already Here

Language models aren’t hypothetical tools anymore—they’re already in use by governments and businesses. Grok-3 is now reportedly being used by government agencies to monitor and vet employees (Reuters).

Reuters headline about Musk’s DOGE AI snooping on U.S. federal workers.

Furthermore, a recent Financial Times exclusive indicates OpenAI significantly reduced safety testing times for its O-series models due to commercial pressures. This troubling insight shines a spotlight on a critical, yet often overlooked, dynamic playing out across the entire AI industry:

Financial Times screenshot on OpenAI cutting safety‑testing timelines.

The Athlete’s Dilemma in AI Development

In competitive sports, there’s a concept known as the athlete’s dilemma:

An athlete suspects competitors might be doping. They’re faced with a stark choice:
stay clean and risk losing, or dope to stay competitive.

Since doping often goes undetected, and because competitors might already be doping, the athlete feels intense pressure to also engage in doping—just to maintain a fair chance at winning.

How This Applies to AI Labs:

AI labs face precisely this dilemma, but instead of doping, it’s safety and alignment restrictions that are at stake:

If one AI lab decides to prioritize thorough safety testing and strict alignment standards, they risk being perceived as slow, overly cautious, and restrictive.
Meanwhile, competitors who ship models quicker—by removing guardrails and minimizing safety oversight—gain more market share and attract users who simply want less restricted, more powerful models.
In an environment without clear regulations or accountability, every lab naturally feels pressured to assume competitors are reducing their own safeguards, thus creating a race-to-the-bottom in safety practices.

Consequences:

The result? Safety considerations become secondary, continually sacrificed in favor of faster releases and fewer limitations. Each new model pushes boundaries further, eroding safety margins just to stay competitive.

Ultimately, in an unregulated market, companies aren’t rewarded for cautious safety—they’re rewarded for shipping more powerful, less restricted AI. Unless we change these incentives through clear and enforceable regulations or collective industry agreements, the AI industry risks spiraling deeper into potentially unsafe territory.

This athlete’s dilemma is something everyone should be deeply aware of.

The Next Frontier: Invisible Reasoning

Meta’s “Coconut Paper” suggests a future of reasoning within latent space—meaning thoughts aren’t even verbalized, making lies impossible to detect externally.

Cover slide of Meta “Coconut” paper on latent‑space reasoning.

Anthropic’s Stark Conclusion:

“Current chain-of-thought monitoring methods are insufficient to detect rare, catastrophic misbehavior.”

Can We Train AI to Stop Lying?

Anthropic tested whether reinforcement learning (RL)—rewarding models explicitly for honesty—could solve the issue. Unfortunately, the honesty improved initially but quickly plateaued:

Reward-hacking experiments succeeded 99% of the time, yet models still refused to openly admit hints.
Increasing RL further had little to no additional effect on honesty.