Humans can easily outsmart AI according to Apple-funded study

Humans vs AI (Image source: Generated using DALL·E 3)

Although they often deliver impressive results, AI engines such as those from Meta and OpenAI, which use large language models, still lack basic reasoning capabilities. A group backed by Apple proposed a new benchmark, which already revealed that even the slightest wording changes in a query can lead to completely different answers.

Codrut Nistor, Published 10/14/2024 🇩🇪 🇫🇷 ...

AI Science Fail

Earlier this month, a team of six AI scientists backed by Apple published a study in which they introduced GSM-Symbolic, a new AI benchmark that "enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models." Sadly, it looks like LLMs are still severely limited and they lack the most basic reasoning capabilities, revealed the initial tests conducted using GSM-Symbolic with the AI engines from industry icons such as Meta and OpenAI.

The problem with the existing models, as uncovered by the aforementioned tests, lies in the lack of reliability of LLMs when subjected to similar queries. The study concluded that slight wording changes that would not alter the meaning of a query to a human often lead in different answers from AI bots. The research did not highlight any model that stands out.

"Specifically, the performance of all models declines [even] when only the numerical values in the question are altered in the GSM-Symbolic benchmark,"

concluded the research, also discovering that

"the fragility of mathematical reasoning in these models [demonstrates] that their performance significantly deteriorates as the number of clauses in a question increases."

The study, which has 22 pages, can be found here (PDF file). The last two pages contain problems which have some irrelevant information added at the end, that should not alter the final result for a human solving it. However, the AI models used have also taken these parts into account, thus delivering wrong answers.

As a conclusion, AI models are still unable to move beyond pattern recognition and still lack generalizable problem-solving capabilities. This year, quite a few LLMs were unveiled, including Meta AI's Llama 3.1, Nvidia's Nemotron-4, Anthropic's Claude 3, Japanese Fugaku-LLM (the largest model ever trained exclusively on CPU power), and Nova, by Rubik's AI, a family of LLMs which was unveiled earlier this month.

Tomorrow, O'Reilly will release the first edition of Hands-On Large Language Models: Language Understanding and Generation, by Jay Alammar and Maarten Grootendorst. Its price tag reads $48.99 (Kindle) or $59.13 (paperback).

Source(s)

AppleInsider

Read all 2 comments / answer

Loading Comments

Comment on this article

New AI-driven scam calls are puttin...

Latest Asus ProArt PX13 OLED creato...

Codrut Nistor - Senior Tech Writer - 6715 articles published on Notebookcheck since 2013

In my early school days, I hated writing and having to make up stories. A decade later, I started to enjoy it. Since then, I published a few offline articles and then I moved to the online space, where I contributed to major websites that are still present online as of 2021 such as Softpedia, Brothersoft, Download3000, but I also wrote for multiple blogs that have disappeared over the years. I've been riding with the Notebookcheck crew since 2013 and I am not planning to leave it anytime soon. In love with good mechanical keyboards, vinyl and tape sound, but also smartphones, streaming services, and digital art.

contact me via: @online_digi, online.digital.craft, LinkedIn

Please share our article, every link counts!