Those who have worked with AI models for various tasks, especially coding, have noticed that the software tools behave inconsistently. In some cases, they simply fail to provide any answers; sometimes they deliver erroneous code, and when they come up with what was expected, they do it slower than usual. This is where the AI Benchmark Tool, located at AistupidLevel.info, steps in, providing real-time information regarding the performance and accuracy of several AI models, including cost data.
The aforementioned open-source tool runs over 140 coding, debugging, and optimization tasks on all large models. For now, it tracks the following: OpenAI GPT, Claude, and Gemini. Grok will be added soon as well. Its highlights include the following:
- Real-time price information, since some models that seem cheap need 10 iterations to get a job done, while others that seem more expensive at first sight will accomplish the same task in 2 iterations, so for a lower effective cost.
- The ability to run the same tests with your own API keys.
- Real-time AI performance monitoring, including live model rankings based on stupidity and smartness.
- Smart recommendations, based on combined performance.
- Notification of active degradations—for example, Gemini-2.5-Flash is now 44% down compared to the baseline value.
Currently, the smart recommendations are these: Gemini-2.5-Flash-Lite for code, Claude-3.5-Sonnet-20241022 for reliability, and Gemini-2.5-Flash-Lite for speed. Everything is open-sourced on GitHub (Repo API, Repo Front End), and anyone can contribute. All the details and the tool itself can be found on the official website, which was mentioned in the first paragraph.
Source(s)
Reddit (translated)














