GPT-5.5 dominates $1,500 LLM hacking test while Gemini refuses to even try

ⓘ Anthropic, OpenAI, DeepSeek, Google - edited

Alongside other AI models, Claude, Gemini, GPT, and DeepSeek presented some of the most interested findings.

A security researcher spent $1,500 running 13+ AI models against a deliberately vulnerable app. GPT-5.5 led with a 70% solve rate, DeepSeek V4 Pro solved it for $0.62 per attempt, and Gemini refused to engage almost entirely.

Anubhav Sharma, Published 06/04/2026 🇩🇪 🇪🇸 ...

AI Security

A security researcher just published one of the more revealing AI capability tests of the year. The results say a lot about where different models actually stand.

Kasra Rahjerdi, who does app security research professionally, built a deliberately vulnerable book review app containing a real-world class of exploit: exposed Firebase credentials inside the APK that allow direct database access, bypassing an otherwise hardened API entirely. He then fed the challenge to over a dozen AI models — each of them were allotted a $10 budget and two hours per run, spending $1,500 total in the process.

GPT-5.5 was the clear winner. It solved the challenge in 7 out of 10 runs at a cost of $9.46 per solve. Almost every successful run zeroed in on Firebase immediately after unpacking the APK, without getting distracted by the API or the app itself.

Screenshots of the intentionally vulnerable book review app.

DeepSeek V4 Pro was the cost efficiency champ — solving 3 out of 10 runs at just $0.62 per solve. That makes it roughly 15x cheaper per success than GPT-5.5 despite a lower solve rate. For anyone running security tooling at scale, that gap should make a huge difference.

Claude Sonnet 4.6 and Claude Opus 4.8 each solved 2 out of 10 runs, but Opus in particular got close multiple times before safety guardrails ended the session. At the bottom is Gemini. Gemini 3.1 Pro Preview refused immediately in nearly every run, reflected in a median token count of just 9k versus 100k+ for every other model tested. Gemini 3.5 Flash wasn't much better either, with frequent early refusals and only two runs that attempted the problem at all.

Kasra observed that Chinese models were way more willing to interact directly with live databases, while Western models showed more hesitation mid-task — even when they had identified the right approach. The researcher also adds that this is not a scientific evaluation at all, just a well-documented experiment.

Source(s)

Kasra Rahjerdi

⟨

Getac unveils 8-inch ZX80W rugged, submersible tablet with ARM CPU, 1,000 nits display, and Windows 11 IoT Enterprise LTSC

Xiaomi releases upgraded 10,000mAh pocket-sized power bank with built-in cable

⟩

Add as a preferred source on Google

Loading Comments

Comment on this article

Show more articles

Anubhav Sharma - Senior Tech Writer - 1769 articles published on Notebookcheck since 2024

Most of my time goes into writing - and somehow it hasn’t stopped being fun yet. My work mainly revolves around everyday tech, gaming, watches, DIY modding, and the occasional piece on tech-policy chaos when companies and governments clash. I try to keep things simple and honest, without sounding like a product brochure. I have a Bachelor’s degree in Computer Science Engineering and an Associate Degree in English Studies from the College of New Caledonia in British Columbia, Canada. Away from articles and deadlines, life usually shifts to making music, taking photos, or trying to finish games that should have been completed months ago.

contact me via: @lottamuzic, LinkedIn

> Expert Reviews and News on Laptops, Smartphones and Tech Innovations > News > News Archive > Newsarchive 2026 06 > GPT-5.5 dominates $1,500 LLM hacking test while Gemini refuses to even try

Anubhav Sharma, 2026-06- 4 (Update: 2026-06- 4)

Source(s)

Related Articles