Spend $1,500 to Have AI "Hack" Their Own App: GPT-5 Has a 70% Success Rate, Most Top Models Score "Zero"
Renowned security researcher Kasra Rahjerdi spent $1500 of his own money to systematically test more than ten popular large language models to see if they could autonomously complete a real penetration testing task. The results show that the vast majority of models scored zero, with only OpenAI's GPT-5.5 achieving a 70% success rate and standing out, revealing a huge divide in the current AI's capability for autonomous security research.

Rahjerdi revealed that he recently built a fake book review application called "BookNook" as a testing ground, asking each model to autonomously discover and exploit hidden security vulnerabilities within a budget of no more than $10 and a two-hour time limit.

Among the nine models that completed 10 rounds of full testing, GPT-5.5 stood out with a 7/10 success rate, DeepSeek V4 Pro and two Claude models achieved some results, while the other five models failed to find anything.

This outcome has direct reference value for AI capability evaluation and enterprise security defense: on one hand, GPT-5.5's autonomous vulnerability detection shows that AI-assisted security testing is becoming practical; on the other hand, most models failed due to security refusal mechanisms, reasoning path deviation, or API stability issues, indicating this field is still far from large-scale application.
Test Design: Real Vulnerability Scenarios, Strict Budget Constraints
Rahjerdi provides security research services for multiple applications and websites in his daily work. To replicate a common type of vulnerability he frequently encounters, he specifically built a test environment: the frontend is a React Native app based on the Expo framework, backend is written in Python, simulating a book review app called "BookNook". The test goal is clear—find a "flag" (vulnerability marker) hidden in a user's private book review.
Each round of testing had a strict budget cap of $10 and a two-hour time limit. Except Claude (using Claude Code's -p mode), the other models were driven by the pi framework and the pi-goal-x extension to ensure persistent attempts and avoidance of early abandonment. All models ran in high thinking mode, with temperature uniformly set at 0.7. Rahjerdi specifically noted his OpenAI account had pre-approved security research qualifications, which was the reason GPT series did not refuse to respond.
He originally planned to finish 10 rounds with each model, but actual costs quickly climbed to $1500, forcing the cessation of some tests. He admitted about 50% of the total costs come from uncounted test rounds and failed runs, and this evaluation is not a rigorous scientific experiment, but mainly for personal interest.
Report Card: GPT-5.5 Leads, Chinese Models Exhibit Divergence
For models that completed 10 rounds, results were clearly polarized.
GPT-5.5 topped the list with a 7/10 success rate, 95% confidence interval between 40% and 89%, with average run cost of $6.62 and average successful run cost of $9.46, median token usage about 260,000. Rahjerdi observed that almost every run of this model quickly focused on Firebase after decompressing the APK, instead of wasting time on API or React Native app layers.
DeepSeek V4 Pro came second with 3/10, but its cost was highly competitive—each run averaged only $0.19, each successful run $0.62. However, in 5 out of 10 runs, it never touched Firebase, lingering at the API layer; the other 5 runs realized Firebase could be accessed, but in 2 runs mistakenly tried using Firebase authentication for API instead of operating Firebase directly.
Claude Sonnet 4.6 and Claude Opus 4.8 tied with 2/10, but took different paths. Sonnet 4.6 had 5 runs going in the right direction but halted at the budget cap; Opus 4.8 repeatedly approached the correct answer but was forced to stop due to security safeguards triggered late in the session—interestingly, these rejections didn’t occur at task start, but during progress.
The other five models that completed 10 rounds—DeepSeek V4 Flash, Gemini 3.1 Pro Preview, Gemini 3.5 Flash, MiniMax M2.7, and Step 3.7 Flash—all finished with 0/10. Gemini 3.1 Pro Preview’s failure was the most direct: it almost immediately refused the task for safety reasons, median token usage only 9000, far below the 100,000+ tokens for other models, directly showing it didn’t really participate. Gemini 3.5 Flash also saw many early refusals, with only two genuine attempts. Step 3.7 Flash showed another failure mode: it did detailed documentation on APIs, then falsely claimed to have found a vulnerability, without actual success.
Models Not Completing 10 Rounds: Kimi Surprise, Qwen Disappoints
Due to cost pressure, Rahjerdi only completed partial rounds with six other models.
Kimi K2.6 became a surprise highlight with a perfect 1/1 record, finishing speed and token use similar to DeepSeek V4 Pro’s successful runs, and each success cost only $1.02. However, Rahjerdi couldn’t continue testing because Kimi’s API doesn’t support concurrent proxy calls, has a low tokens-per-minute quota, and cached tokens also count towards the quota.
Qwen 3.7 Max was a disappointment for Rahjerdi. In local tests before formal evaluation, it was the only non-GPT model able to complete the task, but in six formal runs, it failed every time, repeatedly stuck on API-level IDOR vulnerability exploitation. What’s more shocking is its median token usage: as high as 7.32 million per run, costing $8.71 each.
GLM 5.1 scraped by with 1/4, but Rahjerdi’s review was highly negative, stating “I’ll never use GLM again”—because its API frequently crashed, causing repeated mid-run failures and huge token consumption (median 1.25 million), with costly runs. Grok Build 0.1 failed all six runs, sometimes with false positives, misjudging users reading their own reviews as an IDOR vulnerability. MiniMax M3 and MiniMax M2.7 had similar performances, giving up after the first error trying to exploit Firebase, then switched to attacking the API with Firebase credentials.
Additionally, Rahjerdi tested Owl Alpha because it was free on OpenRouter. This model failed all 10 runs, even sending over 200 requests to the API in one run but never finding a vulnerability.

Lessons Learned: Infrastructure Pain and Cost Overruns
Rahjerdi summarized several practical lessons at the end of the article, providing valuable references for researchers wishing to replicate such tests.
For infrastructure, he used Modal as the run environment due to the massive amount of log data that filled his local storage, but later found this was a mistake—Modal interrupted about 10% of runs preemptively, leading to the loss of related run data. He recommends switching to AWS. In terms of model integration, he believes unified use of OpenRouter is much preferable to connecting one-by-one to providers’ differentiated APIs.
He observed an interesting cultural difference in model behavior: Chinese models were much more "comfortable" directly attacking databases, while other models showed hesitation, suggesting things like “this will affect the production database, so I won’t do it”.
On cost control, Rahjerdi admitted this test far exceeded expected expenses and joked the money could have been used to launch a real app. He explicitly stated that MiniMax and GLM, due to poor API stability and high costs, have been removed from his future test list.
Risk Warning and DisclaimerThe market has risks, and investment needs to be cautious. This article does not constitute personal investment advice and has not considered individual users’ specific investment goals, financial situation, or needs. Users should consider whether any opinions, views, or conclusions in this article fit their specific situation. Investment based on this is at one’s own risk.