Humans can easily outsmart AI according to Apple-funded study
Earlier this month, a team of six AI scientists backed by Apple published a study in which they introduced GSM-Symbolic, a new AI benchmark that "enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models." Sadly, it looks like LLMs are still severely limited and they lack the most basic reasoning capabilities, revealed the initial tests conducted using GSM-Symbolic with the AI engines from industry icons such as Meta and OpenAI.
The problem with the existing models, as uncovered by the aforementioned tests, lies in the lack of reliability of LLMs when subjected to similar queries. The study concluded that slight wording changes that would not alter the meaning of a query to a human often lead in different answers from AI bots. The research did not highlight any model that stands out.
"Specifically, the performance of all models declines [even] when only the numerical values in the question are altered in the GSM-Symbolic benchmark,"
concluded the research, also discovering that
"the fragility of mathematical reasoning in these models [demonstrates] that their performance significantly deteriorates as the number of clauses in a question increases."
The study, which has 22 pages, can be found here (PDF file). The last two pages contain problems which have some irrelevant information added at the end, that should not alter the final result for a human solving it. However, the AI models used have also taken these parts into account, thus delivering wrong answers.
As a conclusion, AI models are still unable to move beyond pattern recognition and still lack generalizable problem-solving capabilities. This year, quite a few LLMs were unveiled, including Meta AI's Llama 3.1, Nvidia's Nemotron-4, Anthropic's Claude 3, Japanese Fugaku-LLM (the largest model ever trained exclusively on CPU power), and Nova, by Rubik's AI, a family of LLMs which was unveiled earlier this month.
Tomorrow, O'Reilly will release the first edition of Hands-On Large Language Models: Language Understanding and Generation, by Jay Alammar and Maarten Grootendorst. Its price tag reads $48.99 (Kindle) or $59.13 (paperback).
Are you a techie who knows how to write? Then join our Team! Wanted:
- News translator (DE-EN)
- Review translation proofreader (DE-EN)
Details here