Notebookcheck Logo

GPT-5.5 dominates $1,500 LLM hacking test while Gemini refuses to even try

Alongside other AI models, Claude, Gemini, GPT, and DeepSeek presented some of the most interested findings.
ⓘ Anthropic, OpenAI, DeepSeek, Google - edited
Alongside other AI models, Claude, Gemini, GPT, and DeepSeek presented some of the most interested findings.
A security researcher spent $1,500 running 13+ AI models against a deliberately vulnerable app. GPT-5.5 led with a 70% solve rate, DeepSeek V4 Pro solved it for $0.62 per attempt, and Gemini refused to engage almost entirely.

A security researcher just published one of the more revealing AI capability tests of the year. The results say a lot about where different models actually stand.

Kasra Rahjerdi, who does app security research professionally, built a deliberately vulnerable book review app containing a real-world class of exploit: exposed Firebase credentials inside the APK that allow direct database access, bypassing an otherwise hardened API entirely. He then fed the challenge to over a dozen AI models — each of them were allotted a $10 budget and two hours per run, spending $1,500 total in the process.

GPT-5.5 was the clear winner. It solved the challenge in 7 out of 10 runs at a cost of $9.46 per solve. Almost every successful run zeroed in on Firebase immediately after unpacking the APK, without getting distracted by the API or the app itself.

Screenshots of the intentionally vulnerable book review app.

DeepSeek V4 Pro was the cost efficiency champ — solving 3 out of 10 runs at just $0.62 per solve. That makes it roughly 15x cheaper per success than GPT-5.5 despite a lower solve rate. For anyone running security tooling at scale, that gap should make a huge difference.

Claude Sonnet 4.6 and Claude Opus 4.8 each solved 2 out of 10 runs, but Opus in particular got close multiple times before safety guardrails ended the session. At the bottom is Gemini. Gemini 3.1 Pro Preview refused immediately in nearly every run, reflected in a median token count of just 9k versus 100k+ for every other model tested. Gemini 3.5 Flash wasn't much better either, with frequent early refusals and only two runs that attempted the problem at all.

Kasra observed that Chinese models were way more willing to interact directly with live databases, while Western models showed more hesitation mid-task — even when they had identified the right approach. The researcher also adds that this is not a scientific evaluation at all, just a well-documented experiment.

Google LogoAdd as a preferred source on Google
Mail Logo

No comments for this article

Got questions or something to add to our article? Even without registering you can post in the comments!
No comments for this article / reply

static version load dynamic
Loading Comments
Comment on this article
> Expert Reviews and News on Laptops, Smartphones and Tech Innovations > News > News Archive > Newsarchive 2026 06 > GPT-5.5 dominates $1,500 LLM hacking test while Gemini refuses to even try
Anubhav Sharma, 2026-06- 4 (Update: 2026-06- 4)