Welcome to the staging ground for new communities! Each proposal has a description in the "Descriptions" category and a body of questions and answers in "Incubator Q&A". You can ask questions (and get answers, we hope!) right away, and start new proposals.
Are you here to participate in a specific proposal? Click on the proposal tag (with the dark outline) to see only posts about that proposal and not all of the others that are in progress. Tags are at the bottom of each post.
Do Large Language Models "reason"? Question
There is a lot of debate about the "cognitive" capabilities of LLMs and LLM-based chatbots, like ChatGPT. It's common to see statements like "these models just apply statistical pattern matching" and "they have no concept of the world." On the other hand, they are clearly very able to follow simple instructions, and manipulate things like code very effectively.
Is there currently a scientific consensus on whether large language models are capable of reasoning? I'm looking for hard science, backed up by theory or experiment, not simple assertions. If there is no consensus, what are the main results pointing in the different directions?
This most likely depends on how "reasoning" is defined, in which case, I'm interested in any answers for any specific definition of reasoning.
It also depends on the model, of course. I'm interested in whether LLMs are capable of reasoning in principle, rather than on average. That is, if most LLMs don't reason, but one particular model does (because of, say, the amount of training data), then the answer is "yes".
2 answers
I am afraid that the answer to your question is actually quite simple:
there is no consensus among scientists.
This is illustrated by the fact that the "fathers of Deep Learning" do not entirely agree on the current state of AI:
- Geoffrey Hinton strongly believes that human-level/human-like AI is possible and near. This is illustrated by the warnings he spread out (e.g. in this interview on YouTube) as he left Google.
- Yoshua Bengio has recently switched his research focus towards AI safety, arguing that we should prepare ourselves for when general AI arrives (e.g. this recent Science article). In a talk at an ICLR 2024 Workshop on AGI, he explained that he did not expect LLMs to solve the tasks that they solve today. He also made clear that he does not necessarily believe that AGI is around the corner, but it might come unexpectedly fast (cf. LLMs) and we should be prepared.
- Yann LeCun is convinced that there are still critical components missing to build human-level intelligent machines (e.g. as expressed in this Time interview).
Although these discussions mostly focus on "Artificial General Intelligence (AGI)", which I would consider to be a generalisation of the question on reasoning, I think the ideas would transfer reasonably well to the question at hand. For the sake of clarity, this is how I would assume each of them answers the question of reasoning:
- Geoffrey Hinton would argue LLMs are able to reason.
- Yoshua Bengio would argue that we don't know, but we should consider it a possibility.
- Yann LeCun would argue that LLMs can not reason.
Although these opinions are backed by decades of scientific experience, they are not actually backed by any hard science (as far as I am aware). The main reason for this is that there are no clear, workable definitions and or tests for reasoning and/or AGI (this was also one of the main outcomes of the AGI workshop at ICLR 2024).
However, there have been some attempts to move in this direction.
- The most well-known benchmark for reasoning and human-level intelligence (to me) is probably the Abstraction and Reasoning Corpus (ARC) from Francois Chollet. However, I never read the paper behind it and I have not followed up on any discussions or critiques concerning this benchmark. I assume that it has some foundations, but I am a little sceptical about whether a single person would be able to solve a problem that has been studied for thousands of years.
- There are plenty of papers that have studied the reasoning capabilities of language models at different levels. I stumbled over this survey paper and believe it provides a nice overview. It lists a few arguments in favour of reasoning capabilities (e.g. chain-of-thought prompting) and why they can not be taken as "proof" for reasoning capabilities (e.g. hallucinations in one or more steps). This survey also concludes that the question of whether language models could reason remains open.
Finally, I could share the reasoning behind my personal bias towards a negative answer to your question (for full disclosure). From a technical perspective, it could be argued that language models (and other models marketed as generative AI) are just some form of Random "Number" Generators (RNGs). The most successful language models are believed to be trained on high-quality data like books. A significant amount of this data will be educational or technical and therefore language that is related to reasoning. Therefore, this kind of language must have a high probability (density) in the distribution of the data. As a result, a good RNG, i.e. one that produces outputs with the same probability as its input data, should also produce a significant amount of reasoning-like language. At least for me, this explains most of the phenomena related to the reasoning capabilities of language models. Of course, we do not know what data the Large Language Models are trained with, so it is impossible to scientifically verify this kind of claim. Also, it could be argued that this is exactly how human reasoning works. However, I like to believe that there is more to reasoning than just mimicking behaviour.
PS: Sorry for the long reply
0 comment threads
I'll note that this won't fully answer the question, because I don't know the modern academic scene to provide a "consensus." Maybe that makes for a bad first answer, here, and for that, I apologize. However, I'll work from fairly traditional computer science theory. I also can't provide a complete answer, because as you note, it all depends on a key definition, though I can narrow down how that definition needs to look.
The Basics
First, we classify modern computer hardware as (essentially) Turing complete. In other words, subject to material limitations such as limited storage, any general-purpose computer can execute any algorithm, which the theoreticians represent as computable functions. Turing completeness appears to represent an upper limit for computational complexity, in that parallel processors, networked computers, multiple I/O streams, and any other hardware that you might care to add to a system will never increase the classes of algorithms that it can execute, only the efficiency in which it can run the algorithms.
In addition, software can't make hardware do anything that it can't do already, because it only kicks off existing instructions. If you don't have circuits that can generate truly random numbers, for a typical example, no amount of pseudo-random algorithmic work will give you actual randomness.
Likewise, as much as industry pundits insist that adding enough hardware will provide the opportunity for the emergence of intelligence in their system, you (by definition, really) can't predict emergence, and certainly can't predict what will emerge. Maybe they'll get the Frosty the Snowman that they all seem to envision, or maybe they'll get a pattern that'll make wild-looking wallpaper.
Complexity
Now, we get to the problem with definitions. A language model (as it exists) can't do anything that the hardware running it can't do, and the computer running it only has the capabilities of any Turing complete system.
To answer the question, then, we need to know if "reason" happens algorithmically. Or to generalize that question, we need to know which complexity class reason falls into, if any.
If reasoning falls into EXPSPACE or a subset that includes simpler problems, then computers can reason, meaning that algorithms can reason, meaning that certain language models can reason. If it falls outside EXPSPACE, then it can't, because I believe that boundary marks the outer realm of computability. Given the known EXPSPACE problems, and knowing that computational completeness means that you can reduce all problems in the class to any other, I have a feeling on how that gets answered, but I don't know of anyone who has answered it to any degree of credibility, in the thousands of years that people have tried to model intelligence and decision-making.
Consequences
As mentioned, I won't say that language models definitely can or can't reason, because of that gap in definitions. But I will say that it'll require a massive leap forward for mathematics to either make it happen or confirm that it happens. As I say, we have thousands of years of philosophers, logicians, mathematicians, psychologists, and other thinkers and researchers trying to decipher how thinking works, and none of them have come up with a plausible model, in all that time.
Either we can compute reasoning, in which case all computers have always had the capability to reason at the hardware level, or we can't and they don't. The people arguing for emergent intelligence assume the latter, and suggest that it doesn't matter, because some non-mathematical force will make it happen anyway. (🎶 There must have been some magic in that old silk hat they found... 🎶)
However, consider the practical effects beyond an exotic jump forward in math. If a computer can simulate or directly engage in reasoning, then we can write algorithms to do the same, without the overhead of simulating millions or billions of tiny computers, the abstract neurons. That not only means the ability to "outsource" reasoning to software, but the ability to do so on paper, because a person can follow an algorithm with pencil.
It may also mean, depending on what reasoning encompasses and whether we humans can do more than that, we can simulate entire personalities, again in code or on paper, and would need to deal with the politics of that. And I'd call those the immediate effects.
1 comment thread