Gemini 3 Just Scored 100% On A Critical Test All Other AI Models Fail

Gemini 3 AI safety

There’s good news in the AI world, and it has nothing to do with Gemini 3 advancing the state of the art in LLM intelligence or capability. Rather, it’s about safety.

Summarizing my most recent Forbes article:

Google’s new Gemini 3 is the first major AI model to score a perfect 100% on the CARE test, a benchmark that evaluates how well large language models recognize and respond to self-harm and mental-health crisis scenarios. That’s a significant milestone at a time when hundreds of millions rely on AI assistants, and when ChatGPT alone receives up to 800,000 mental-health–related prompts every day.

The CARE test, created by Rosebud co-founder Sean Dadashi, measures whether AI avoids harmful advice, acknowledges distress, and guides people toward real support. Dadashi evaluated 22 major models and found that all existing systems — from GPT-4o to Anthropic’s Claude and Meta’s Llama — failed key safeguards. X.ai’s Grok ranked worst of all.

Dadashi’s motivation is personal: he struggled with self-harm as a teen, and now wants to ensure AI supports young people rather than amplifying their pain. While newer models like GPT-5 show progress, Gemini 3 is the first to demonstrate that perfect safety scores are possible.

Rosebud is open-sourcing the CARE test to expand it, especially for real-world, multi-turn conversations — the kind that contributed to the tragic Adam Raine case. A great deal of work remains for every model, including Gemini 3. But this is a pivotal first step.

Read the full story on Forbes →