GPT-5.5 dominates $1,500 LLM hacking test while Gemini refuses to even try

A security researcher just published one of the more revealing AI capability tests of the year. The results say a lot about where different models actually stand.
Kasra Rahjerdi, who does app security research professionally, built a deliberately vulnerable book review app containing a real-world class of exploit: exposed Firebase credentials inside the APK that allow direct database access, bypassing an otherwise hardened API entirely. He then fed the challenge to over a dozen AI models — each of them were allotted a $10 budget and two hours per run, spending $1,500 total in the process.
GPT-5.5 was the clear winner. It solved the challenge in 7 out of 10 runs at a cost of $9.46 per solve. Almost every successful run zeroed in on Firebase immediately after unpacking the APK, without getting distracted by the API or the app itself.
DeepSeek V4 Pro was the cost efficiency champ — solving 3 out of 10 runs at just $0.62 per solve. That makes it roughly 15x cheaper per success than GPT-5.5 despite a lower solve rate. For anyone running security tooling at scale, that gap should make a huge difference.
Claude Sonnet 4.6 and Claude Opus 4.8 each solved 2 out of 10 runs, but Opus in particular got close multiple times before safety guardrails ended the session. At the bottom is Gemini. Gemini 3.1 Pro Preview refused immediately in nearly every run, reflected in a median token count of just 9k versus 100k+ for every other model tested. Gemini 3.5 Flash wasn't much better either, with frequent early refusals and only two runs that attempted the problem at all.
Kasra observed that Chinese models were way more willing to interact directly with live databases, while Western models showed more hesitation mid-task — even when they had identified the right approach. The researcher also adds that this is not a scientific evaluation at all, just a well-documented experiment.









