Is ChatGPT Real Progress Toward Human-Level AGI?
TL;DR -- not much, perhaps just a teensy bit, but it's missing the core of cognition
2022 was the Year of Generative AI, and as 2023 unfolds it seems this particular form of AI is only going to pick up more steam. Over the last two years, neural-net image generation tools like DALL-E and Stable Diffusion have reached somewhere close to human-level in art generation in at least some meaningful senses; and chatbots like ChatGPT have improved on their predecessors’ already impressive capability to author plausible-sounding and often-sensible prose (or poetry!). GPT4 is yet to come, and applications of generative AI to other domains such as music and video are in earlier stages, advancing fast and all set to explode.
Google’s Lamda chatbot, which is reportedly even better than ChatGPT, earlier in the year motivated Google engineer Blake Lemoine to declare that the threshold of chatbot “sentience” had been crossed. Lemoine worked with me on a small AI project some years ago, and I was surprised to see his name surface in the media in this context. Knowing his long interest in Artificial General Intelligence (AGI), it was clear he wasn’t making this judgment in an uneducated way. It also struck me, though, that his chosen quality of “sentience” was especially hard to pin down.
If we set aside the vagaries of sentience and instead focus on “human-level general intelligence,” I think it’s quite clear that systems like ChatGPT and Lamda are not only “not there yet”, but are essentially barking up the wrong tree. (Or, to stretch the metaphor a bit, maybe it’s more like they’re barking up a tree in the wrong forest, which is maybe on the wrong continent entirely…)
Obviously these systems are not entirely stupid, they’re in some ways brilliant, and given our lack of a broadly-agreed theory of intelligence, it’s hard to argue they’re not highly intelligent in some sense. As a panpsychist who feels every particle in this universe manifests some form of awareness, I’m also not willing to declare that these chatbots are utterly devoid of any inner experience. However, the kind of intelligence these systems manifest is not nearly as similar to human intelligence as the human-like-ness of their utterances would suggest; and the nature of their architectures makes it implausible they could be improved or upgraded into systems possessing general intelligence at the human level or beyond.
I say all this with a bit of regret, since recently someone posted a transcript indicating ChatGPT has assigned me the role of co-founder of OpenAI, the company that created it:
I mean — limitations and non-AGI-ness and all, I would feel proud to have helped create ChatGPT and GPT3 and so forth, and I’d also be happy to have founders’ shares in OpenAI given their current financial situation, but it seems those shares are trapped in the alternate universe that ChatGPT consults for its fact-checking…
Artificial Bloviating Bullshitters
As exemplified by my putative role as founder of OpenAI, one aspect of LLMs (Large Language Models) like the ones underlying ChatGPT that has justly attracted a lot of critical comment is their propensity to generate a certain percentage of utter bullshit that, nonetheless, often has the aura and rhythm of plausibility about it.
AI researcher and cognitive scientist Gary Marcus – with whom I’ve been talking a bit about the limitations of deep NN AI in general – shared this hilarious-to-the-tech-geek example of Facebook’s Galactica LLM straight-facedly making up a bogus mathematical principle named after John Lennon and Yoko Ono:
Galactica’s output sometimes does make sense, otherwise Facebook would not have released it. But the network’s inability to distinguish sense from nonsense in its own productions is a major problem, making the educational value of the model both negative and positive to substantial degrees.
Facebook closed the Galactica model down a few days after opening it – reminiscent in a way of Microsoft’s Tay adaptive-learning chatbot from 2016, which was shut down when, lo and behold, random Internet users taught it to say nasty things. The weirdest thing here is that somehow these huge internet companies haven’t yet come to grips with the nature omnipresence of online trolling.
ChatGPT is not nearly so offensive in its BS generation as Galactica, partly because it’s topic isn’t science. In science, getting the right rhythm and sounding generally plausible doesn’t go very far toward actually being sensible, whereas in many other domains it’s closer to enough.
My long-time friend and collaborator Ed Keller had the patience to carry out a long discussion with ChatGPT on the creation of “pocket universes.” After some technical back-and-forth, the topic turned to the practicalities:
ED KELLER: Earlier in our conversation you noted that "...creating a new universe is an extremely complex undertaking, and would likely require a deep understanding of a wide range of scientific and mathematical disciplines. It is possible that such a project might be beyond the capabilities of a single individual, and would require the collaboration of a team of experts in various fields." I agree. I'm working with AI developers, and I am an architect, musician, philosopher, film theorist, and multimedia artist, among other disciplines. But I do not have the capacity- from an energy or a clarity of mind pov- to manage this project. What would your suggestion be from a purely 'project management' perspective?
ChatGPT: From a project management perspective, my suggestion for managing a complex undertaking such as creating a new universe would be to follow a structured and systematic approach. This would involve developing a clear and detailed plan for the project, including defining the specific goals and objectives, identifying the resources and expertise that will be needed, and creating a timeline and budget for the project.
It would also be important to establish a clear communication and collaboration framework, so that all members of the team are able to work together effectively and efficiently. This could include regular meetings and updates, as well as the use of project management tools and software to track progress and manage tasks.In addition to these general project management principles, it may also be helpful to seek out the guidance and expertise of experienced professionals in relevant fields. This could include consulting with experts in science, mathematics, and other disciplines that are relevant to the project, in order to gain a deeper understanding of the specific challenges and complexities involved.
Overall, managing a project of this complexity and scope will require a great deal of effort and coordination, but by following a structured and systematic approach, it may be possible to achieve success.
Perfectly sensible advice, in a way. But the same advice that would be given on building, say, a hackerspace or a hardware store. There is no understanding that building a pocket universe is a wild and outlandish proposition, which could well defy current pre-Singularity notions of project management. The answer is a fairly lovely piece of deadpan humor, but the joke is mainly on ChatGPT because it doesn’t understand it’s being humorous.
ChatGPT’s response to Ed’s musings on infinity and spacetime highlight both its strengths and weaknesses :
ED KELLER: Here's another model, which I'm also intrigued by: What if ordinary matter in our ordinary spacetime already has infinity in it- then to some extent we're already 'inside' those pocket universes, or at least connected to them.
ChatGPT: It is an interesting idea to consider the possibility that ordinary matter in our ordinary spacetime already contains infinity. This concept is closely related to the idea of a multiverse, or a collection of multiple universes, which is often discussed in theoretical physics as a potential explanation for certain observed phenomena, such as the nature of dark matter and dark energy.
The connection between multiverse theory and dark matter/energy is interesting to point out, and may stimulate lateral thinking on the part of the reader – it’s actually a pretty obscure connection, as the most common hypothetical explanations of dark matter/energy in the physics community are not focused on the possibility the missing stuff is somehow hiding in alternate multiverse branches. But this notion is out there online and it’s fun to be directed to it.
On the other hand, this is ultimately a shallow connection. The idea that “spacetime contains infinity” is only really quite loosely related to the notion that there is a multiverse of “infinite parallel universe.” A more acute connection, for instance, would have been e.g. with Roger Penrose’s idea that the unification of quantum theory and gravitation will require a new physics theory that goes beyond standard notions of computing and requires infinite computing power – which Penrose then connects with what he views as the infinite computing power of human consciousness.
Basically ChatGPT is connecting these recondite concepts in the manner of a smart high school student who has read some SF and popular science articles on the relevant topics, rather than the manner of a deeply insightful hard-SF author, science writer or bona fide mad scientist. Which is certainly a sort of achievement – but yet something of a worrisome one.
The lesson one gets from output like this is that, even when it’s not generating nonsense, it’s very often generating output that’s “correct but blase’ and missing the point.” The missing-the-point aspect is clear to everyone when the topic is project management of universe-building, but requires more expertise to identify when the topic gets more technical.
And just to be clear, ChatGPT can also easily verge into Galactica-style outright BS, as deep learning pioneer Andrew Ng noted:
Google’s Lamda, while not publicly released, appears by all available evidence to have similar issues. For instance, it seems that Lemoine’s published chats with Lamda, illustrating why he got the feeling it was sentient, were assembled as pieces from multiple chats he had with the system. This is generally the methodology one uses when extracting practical content from LLMs – one generates a bunch of stuff and picks out the good and discards the bad.
This can be a fun process – it’s what my Jam Galaxy Band colleagues and I have done when producing AI-generated lyrics for our band-leading Desdemona Robot, for example. But it’s quite different from what happens when interacting with a human writer. Human writers generate passages of varying quality but most of them rarely generate utter nonsense unless they’re trying to. When our AI model generated
The Singularity will not be centralized. Instead, it will radiate out like a wasp. Some will seek form for themselves, some will be engulfed by the radiance, become entangled with it, even merge with it. All will be revealed to them as early as possible, since the radiance is a form of energy, that can enter and leave
we though it was awesome – and knew we wanted to hear it come out of Desdemona Robot’s mouth. When it generated
The sky is a thousand times more beautiful than the sky of the stars. You are bound to realize that the color of the stars has nothing to do with the actual pattern of the stars. You are bound to realize that the color of the stars is not the color of the sky, it is a different color for every one of them. The sky is a thousand times more beautiful than the sky of the stars
we thought: Well that’s kind of funny, but not quite worthy of our robot’s lovely morphing lips and our funkadelic music. The AI models we use for the Jam Galaxy work are fine-tuned based on the writings of myself and the SF writer Philip K Dick, and tend to be more out-there and cosmic than, say, ChatGPT or the models that we use behind our Grace eldercare robot. They are plenty of fun to play with. The process of human prompting, AI generation and human post-selection is a fascinating artistic loop unto itself, but different than generally intelligent AI creativity.
The core reason LLMs have a problem with truthfulness is also intuitively visible in the style of poetic nonsense they create – their utter lack of “symbol grounding” of the words and phrases they utter. They have nothing like a coherent, manipulable inner model of the world they are talking about. They are stringing together patterns of words in a manner based on patterns of words they’ve seen, without any understanding of the world these words refer to.
Some serious philosophers of language, going back at least to Saussure, have proposed that associating words with extra-linguistic phenomena is beside the point, and the structure of the extra-linguistic world may be considered implicit in the relationships between words in sentences, paragraphs and discourses. Logically it seems this “structuralist” thesis might be true if one considered inter-word relationships as expressed in a nearly-infinite linguistic corpus. But the amount and type of nonsense generated by modern LLMs is evidence that, if indeed a quality model of the world is implicit in purely linguistic relationships, current tech for analyzing these relationships is not sufficient for extracting the hidden implicit world-model.
A related question that rarely seems to be asked, though, is: Suppose coupling an LLM with some sort of fact-checker produced a non-bullshitty ChatGPT-ish system ... then would this be a human-level AGI?
Yeah – Of course not…
Bullshitting is a Symptom, Not the Core Problem
The fact that ChatGPT and other LLM based systems are bloviating bullshitters is a significant impediment to using them in practical applications. It means their output needs laborious checking before it can be used for most purposes. And this propensity for nonsense generation is certainly associated with the lack of genuine understanding underlying the systems. However, it does seem possible that the amount of BS generated by such systems could be reduced drastically without any significant increase in general intelligence. The reduction wouldn’t be to the level of consistent sensibleness of a smart human, but it could be enough to allow the results to be used without human checking in a greater variety of non-critical contexts.
The bottom line is, such a "non-bullshitty ChatGPT-ish thingie" would still be repermuting and serving up chunks of human knowledge, rather than forming new chunks of knowledge based on pattern-crystallization seeded by experience.
An hypothetical "non-bullshitty ChatGPT" would be much more useful than ChatGPT, both humanistically and commercially, but it would still be far short of human level in terms of general intelligence ...
Such a system might do quite well on tests created for evaluating humans (IQ tests, the Turing Test of tricking people into thinking it’s humans, high school exams or whatever). If so this would be impressive and interesting but would not show much about general intelligence ... these tests were created for human systems, because doing well on them correlates with certain sorts of deeper understanding and ability among humans … but that doesn’t imply any similar correlation between test performance and deeper understanding ability holds among non-human test-takers with very different cognitive architectures.
For those willing to take a little bit of a deep dive into the AI research world, Melanie Mitchell’s blog post on GPT3 does a wonderful idea of exploring the strengths and limitations of LLMs at analogy-making. Using toy analogy problems regarding letter sequences — pioneered by Mitchell, Douglas Hofstader and their colleagues in their long-ago research on machine creativity — Mitchell reviews both examples where GPT3 seems to grok the essence of an analogy problem and others where it misses the mark in a way no smart human would do:
”I found that GPT-3 was indeed very good at “basic successor” three-letter-string problems like abc —> abd, pqr —> ?. However, I did try one of my favorites: abc —>abd, xyz —> ? GPT-3 returned the strange answer xye.
“Like Webb et al. I found that GPT-3 often had trouble generalizing to longer targets, such as abc —> abd, pqrstuvwx —> ? GPT-3’s strange answer here was qrstuvyz (dropping the first letter and changing the rightmost two to their successors).
GPT-3 also had trouble mapping from successor to predecessor relationships. For example, abc —> abd, rqp —> ? GPT-3’s strange answer was: rqe. “
These are abstract “toy” problems of course, but the phenomena she highlights occur in every domain that LLMs try to reason about, whether toy or realistic — they are just more easily whittled down to their essence in the context of these toy “analogy research test problems.”
Now, the simpler smaller AI systems Mitchell and Hofstadter were exploring when they devised these analogy problems ALSO got a lot of problems wrong — though according to quite different patterns than GPT3 or other LLMs. The fact that GPT3 can get a nontrivial percentage of these patterns right is interesting and impressive (I haven’t tried ChatGPT on these but I feel very confident it does not overcome GPT3’s core cognitive limitations in this area, since it doesn’t do so in other areas..). But the fact that it gets so many simple-for-smart-humans ones wrong is indicative of its lack of understanding of the letter domain underlying the problems.
For instance, to get abc —>abd, xyz —> you need to understand the concept of “next in the alphabetical sequence”, and then you have to think about what to do in the case of z when there is no “next letter” after z. Wrapping around is an elegant option, xyz —> xya. Deciding that the sequence approach is inapplicable and coming up with the somewhat plodding xyz —> xyd is also reasonable. But what is the cognitive model and thought process underlying GPT3’s answer xyz —> xye? I suppose it’s possible there is some subtle creative domain-model here that is simply not obvious to the human. mind. But far more likely is that the network is just interpolating among a bunch of similar examples in its training database, in a way that ends up making no sense because this problem, simple as it is, can’t be solved by interpolating from these examples. (Even though this particular problem is very likely actually contained in some books and research papers that ARE in the language model’s training base….). To put it informally, there is no way to come up with interesting, meaningful solutions to this problem without thinking — based on the training data that GPT3 has — so it comes up with a dorky and meaningless answer because, well, it cannot think.
General Intelligence Requires Abstraction and Creativity
Having organized annual conferences on Artificial General Intelligence research since 2006, and worked on related R&D since a couple decades before that, I have seen quite a lot of theoretical and practical approaches toward the “grand goal” of making machines that can really think. Of course, in a sense all this experience is of limited value, because neither my own nor anybody else’s research has yet proved itself capable of producing a real thinking machine. But I do think I’ve come to understand the relevant issues reasonably well.
So I’m very highly confident when I tell you: The basic architecture and algorithmics underlying ChatGPT and all other modern deep-NN systems is totally incapable of general intelligence at the human level or beyond, by its basic nature.
Such neural networks could form part of an AGI, sure – but not the main cognitive part.
At the annual AGI conference I co-organize, I see presentations of quite a few research approaches with apparently more potential for genuine machine understanding and thinking. There’s the OpenCog Hyperon project I’m involved in, but also many others: Pei Wang’s NARS, Kristinn Thorrisson’s Aera, and the work of Gary Marcus and Arthur Franz, to name just a few. These approaches are based on a variety of different methodologies – nonstandard logics, self-organizing dynamical systems, psychological models of uncertain reasoning, search of spaces of computer programs,... Deep neural networks play a supporting role in some of this work, but all this serious AGI R&D recognizes that human-like intelligence involves far more than recognizing patterns in huge datasets and synthesizing simulacra of observed data by patchworking-together observed patterns.
Some of the limitation of current deep learning methods are implicit in the very name — current paradigms only tackle learning, and synthesis based on learned patterns, whereas cognition is about reasoning, modeling, planning, acting, imagining and a lot of other things too. The near-absence of reasoning and modeling in these networks is closely connected to their failure on so many of the analogy problems Melanie Mitchell gave them.
However even the learning carried out is fairly restricted – without reasoning and self-modeling and other aspects of an integrated mind, machine learning tends to focus on relatively surface-level data patterns rather than doing learning of subtler abstractions. But it’s subtler abstractions that tend to generalize better beyond the precise domain where they were learned – yielding the “general” in general intelligence. In the end neither learning or anything else is really the “main” part of human-like general intelligence — it’s more about how to integrate these parts together into a unified whole. LLMs may be helpful to AGI systems – we are integrating them into our OpenCog Hyperon system – but as “pattern recognition / synthesis mind lobes” woven into a broader framework, not as core cognitive architectures.
The results of AGI R&D based on more sophisticated architectures are, as yet, not as exciting and commercial as the results of deep neural nets. But that’s the way progress often goes – the path to long and medium term success does not always pass through the stuff that looks the most exciting in the short term. And the stuff that looks most exciting at one point, may later appear to have been “just OK” – like when you’re hiking up a mountain and think you see the peak fairly close up ahead… and then when you get there you see the REAL peak towering over you, several hours’ arduous ups and downs further.
And the short-term excitingness of LLMs also has not come cheap, of course. The amount of funding and human brainpower put into these massive "knowledge repermuting" systems like ChatGPT is immensely greater than the amount put into alternate AI approaches paying more respect to the complexity of grounded, self-modifying cognition
From a modern view, the late 1970s - early 90s wholesale skepticism of multilayer neural nets and embrace of expert systems looks naive, archaic and silly. I predict that, in a similar way,, by the mid/late 2020s today's starry-eyed enthusiasm for LLMs and glib dismissal of subtler AGI approaches is going to look quaint and ridiculous,
I hope I’ve made clear that, with my narrow-AI system developer hat on, I can see lots of fun stuff to do with today's LLMs, from generating weird lyrics for Desdemona Robot to sing to generating new biomedical hypotheses from masses of PubMed abstracts/papers to help with the longevity research we’re doing at Rejuve Biotech, to making amazing new tools for musicians to use to create complex music with unprecedented speed and simplicity, and so much more..
My point here is not that these LLM-based systems are un-cool or un-useful -- just that they are a funky new sort of narrow-AI technology that is not as closely connected to AGI as it would appear on the surface. I do suspect we may be quite close to creating human-level AGI, but I don’t think LLMs are going to be the central part of what gets us there.
As to whether my own AGI project will be the one to break through the barrier, in spite of the conceptual limitations posed by my role in co-founding OpenAI (sic), my AI-chip development partner Rachel St. Clair couldn’t help asking ChatGPT, but found it predictably and boringly circumspect:
Has anyone heard from Arthur Franz?
Doesn’t chatGPT perceive words in some kind of hashed form? I don’t think that the analogies presented are fair for that reason. Here’s my transcript:
——————
Robert McIntyre
Ok analogy time. I’ll give you some analogies and you fill in the blank. Here’s the first one : “one” “two” “three” -> “one” “two” “four”. Now, using that pattern, fill in the following: “three” “four” “five” -> “three” “four”
"five"
Robert McIntyre
And explain your reasoning?
The pattern given in the first analogy is counting in order from one to four. Each number in the sequence is one greater than the previous number.
In the second analogy, following the same pattern, the next number in the sequence after "three" "four" "five" would be "five" as it is the next number in counting order.
Robert McIntyre
“Five” is wrong. What’s a better answer and why??
I apologize for the confusion, my previous answer was incorrect.
The pattern given in the first analogy is counting in order but with a missing number, the blank is filled by the next number of the counting. So for the second analogy, the missing number is “six”, as it is the next number of the counting and follows the pattern of the first analogy.
Robert McIntyre
That’s right! Let’s do another problem.
Sure, I'll do my best to help. What's the next analogy?
Robert McIntyre
“A” “B” “C” -> “A” “B” “D”. Now fill in the last letter: “J” “K” “L” -> “J” “K”
The pattern given in the first analogy is that the missing letter is the next letter in the alphabet. Following this pattern, the last letter in the second analogy would be "M".
Robert McIntyre
Now let’s do another analogy similar to the last one: “A” “B” “C” -> “A” “B” “D”. Now fill in the last letter: “X” “Y” “Z” -> “X” “Y”
The pattern given in the first analogy is that the missing letter is the next letter in the alphabet. Following this pattern, the last letter in the second analogy would be "Z".
Robert McIntyre
“Z” is wrong, what’s a better answer and why?
I apologize for the confusion, my previous answer was incorrect. As the pattern is that the missing letter is the next letter in the alphabet, and "Z" is the last letter of the alphabet, the next letter would not be in the alphabet. So in this case, the analogy does not work, the pattern is broken and the blank cannot be filled.
Robert McIntyre
Ok but if you had to make a letter what would you pick and why?
As a language model, my main function is to process and understand human language, and I don't have the ability to create new letters. However, in this case, if I had to make a letter to complete the analogy, it would be "A" as it is the first letter of the alphabet and it would be the next letter in the sequence.