Has Apple proven that large language models are incapable of logical reasoning?
I don't think so.
Apple researchers recently published GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models, which they claim provides evidence that large language models (LLM) can’t perform logical reasoning.
We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer.
While I find their findings interesting, they failed to consider alternative hypotheses. Instead of showing that LLMs can’t reason, they may have shown that their logical reasoning is sometimes flawed.
So, what were their findings, and why do I disagree with their conclusions?
GSM-symbolic
GSM8K is a popular benchmark for evaluating LLMs’ mathematical reasoning capabilities. The dataset comprises 8000 grade-school math questions and answers. However, many researchers fear that LLMs are trained on GSM8K, meaning that we have shown the math students the exam questions and answers before the exam.
Mirzadeh et al. (the Apple researchers) evaluate what happens if you make small changes to the GSM8K questions. They create question-answer templates that allow them to change the variables in the questions effortlessly.
LLMs are sensitive to meaningless changes.
Below are the results when the names (e.g., Sophie, Jack, etc.) and numbers are varied. Each of the six plots shows the result for a specific LLM. The dotted black vertical line indicates an LLM’s performance on the original GSM8K benchmark.
Changing the names impacts the performance, and changing the numbers changes it even more. Changing the names of the characters in a math question should naturally not affect the answer. But, is being affected by irrelevant changes evidence that the models are incapable of reason?
It could, but it could also indicate that their logical reasoning is flawed. For example, LLMs are known for preferring frequent tokens over in-frequent. If any of the names or numbers in the template are more common, that may impact the performance. Also, how each model tokenized the numbers may impact the performance. Is it stupid? Yes. Does it mean no reasoning? No.
LLMs are worse at harder questions.
Next, they evaluated the effect of changing the questions’ difficulty. They created three additional templates with increasing difficulty: M1, P1, and P2.
They found that as the difficulty increases, the performance decreases, and the variance increases. Again, is this proof that LLMs can’t reason? Nope. You can perform logical reasoning but struggle with difficult questions. Also, the more tokens you have to output to answer a question, the more probable are errors. It’s quite simple. The more times you try to predict sometimes, the more likely you are to make a mistake.
Trick-questions trick LLMs
They create a template to test how inserting “irrelevant” sentences affects LLMs. In the example below, they add, “but five of them were a bit smaller than average,” which the LLMs infer that the kiwis should be subtracted.
The figure above depicts the performance drop when inserting these irrelevant sentences. Mirzadeh et al. claim this is evidence of a lack of logical reasoning.
I completely disagree. LLMs are trained on a vast amount of text. Most of this text is not produced by mathematicians. Most people are not as precise as mathematicians. They will often imply things despite not stating them explicitly. Understanding imperfectly formulated questions is the desired LLM behavior; otherwise, LLMs will misunderstand too many questions posed by non-mathematicians.
So, no, this is not evidence that LLMs can’t reason. They are just trained to infer information from imprecise questions.
Summary
While the experiments in this paper were excellent, they failed to consider alternative hypotheses in analyzing their findings. It seems they had decided on their conclusion beforehand, i.e., confirmation bias. Don’t get me wrong. Their hypothesis could be correct. I just wish they did a better job considering other explanations.