Friday, April 7, 2023

Microsoft's "Sparks of Artificial General Intelligence" Study: Some Reflections

Since the release of GPT-4 a scarce month ago we have seen an abundance of comment on the capabilities of the system--overwhelmingly superficial comment, with little critical thought in evidence, such that a discerning observer may read a great deal of it without having much basis for judging its actual significance. (We are told over and over and over again that GPT-4 did well on a Bar Exam. That sounds impressive. But what does that really mean?)

However, a new study from a team of scientists at Microsoft, "Sparks of General Intelligence: Early Experiments With GPT-4," offers something more substantial. Working from a definition of intelligence laid out in a noted 1994 editorial produced by a large (52 member!) group of psychologists ("a very general mental capability that" implicitly "encompasses a broad range of cognitive skills and abilities" that includes "among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience"); a standard for what constitutes "Artificial General Intelligence" (as an artificial intelligence able to perform in the aforementioned ways at a level at least comparable to that of a human); and an understanding of how GPT-4 functions (as a neural network-based "Large Language Model" (LLM) trained on a vast body of web-based text "using at its core a self-supervised objective of predicting the next word in a partial sentence"); they devised a series of challenges which would be "novel and difficult" for such a program. In doing so they endeavored to challenge GPT-4 not only with a variety of demanding problems (coping with imagery, producing microcode, solving mathematical problems, etc.), but also highly idiosyncratic challenges that would require the chatbot to synthesize knowledge and skills from different areas to cope with problems for which its training was very unlikely to prepare it, demonstrating that more than memorization was involved, and that instead it possessed a "deep and flexible understanding of concepts, skills and domains." An excellent example is one test requiring it to present the proof of the infinitude of primes (aka "Euclid's theorem") written in the form of Shakespearean poetry, which forced the system to combine "mathematical reasoning, poetic expression and natural language generation." Still another was its being required to solve a riddle that required the system to come up with a conclusion based on the geographical location of the event described purely from clues that, it might be said, could only be handled on the basis of the "common sense" that has been such a challenge for AI developers. (Specifically it was expected to guess the color of a bear a hunter shot in a location where walking one mile south and one mile east walking one mile north brought him back to where he started--which can only be the North Pole, where from any point one mile south a mile back north brings one back to where they were originally.)

All of this established the Microsoft study then proceeds to detail the experiments and their results, and then present more general conclusions drawn from them. Ultimately the authors judged that "in all of these tasks, GPT-4's performance is strikingly close to human-level performance"--it even met the common sense challenge--such that they "believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system" (emphasis added).

Because italicizing the relevant text doesn't seem to do it justice I am going to repeat it: "reasonably . . . viewed . . . as an early . . . version of an artificial general intelligence."

In other words, it is reasonable to say that not only is artificial intelligence here, but AGI is here. Now. As an actuality that millions of people are tinkering with.

Still, in looking at that statement one should not forget the qualifications. It may be reasonable to say GPT-4 is an AGI . . . but an early" and "still incomplete" version of it, and the team devotes as much time to explaining the limits of the AGI at hand as wowing us with its genuinely impressive capabilities. Key to their analysis is the distinction that psychologist Daniel Kahneman drew between "fast thinking" and "slow thinking." The former is automatic and intuitive, the latter controlled and rational--the kind where we have to consciously reason our way to a solution of a problem, a process which is, as implied by the terminology, slower, and more "effortful," but also likely to be more accurate and reliable.

As one might guess from the description of its functioning the word-predicting chatbot is a fast thinker which "essentially" may be said to "come up with an answer in . . . a single pass of the feedforward architecture." This works well with some problems, but not others, particularly those tasks which are "discontinuous" and so "require planning ahead . . . or a 'Eureka idea'"--and when faced with such tasks GPT-4's performance suffers, with the authors as one example of this another problem related to prime numbers. While GPT-4 did surprisingly well at explaining Euclid's theorem by way of Shakespearean poetry (inventively writing that explanation out as a dialogue between Romeo and Juliet) it did not do well at what would seem the much humbler task of simply giving us a list of the prime numbers between 150 and 250, with the authors arguing that this was because this is the kind of "slow thinking" task where most of us would "get out a scratchpad" and work it out--whereas GPT-4 has no such function, or even the basis for one. As this implies GPT-4 is simply not equipped to assess the quality of its own information or thought process, and not very good with context--deficiencies reflective of other lacks, like the absence of long-term memory, and faculties for making use of that memory through "continual learning." Together they leave it very sensitive to the form as well as the content of inputs--"the framing or wording of prompts and their sequencing in a session" easily throwing it even where tasks it can perform well are concerned. The authors of the study also stress that GPT-4 lacks the "confidence calibration" that would let it distinguish between when it is "guessing" and when it actually "knows"; and as many have remarked, prone to "hallucinating" false information. The result is that for its designers to improve this "early" and "still incomplete" AGI into something more complete requires that they add a capacity for "slow thinking" as well as "fast thinking," with the former overseeing the latter; long-term memory; a continual learning capacity to take full advantage of that faculty; and they add a shift beyond single word prediction toward "higher-level parts of [a] text such as sentences, paragraphs or ideas" (which they recognize may or may not emerge from a "next-word-prediction paradigm," even with these improvements).

Even as someone long inclining toward skepticism regarding the significance of the performance of systems like GPT-4 (I have tended to focus on how AI copes with the physical world, an area where the record has been rather disappointing) I have to admit that the case made in the study impressed me, rather more than anything I have heard or read about OpenAI's products since these started grabbing headlines last year. Indeed, I find myself thinking that here, after a great many false starts, AGI may be finally shifting from science fiction to science fact, with some of the capabilities that were supposedly "just around the corner" for decades finally arriving. (Indeed, in this month marking the forty-first anniversary of the Fifth Generation Computer Systems Initiative, it seems that we may be finally getting what was promised then, with important economic and cultural implications perhaps not too far off.)

However, I am also impressed by the constructive criticism offered herein, which seems to me well worth thinking about, particularly in regard for the necessity of a slow thinking function overseeing the more "autocomplete"-like function being demonstrated (the more in as, I suspect, that shift from single-word prediction to dealing with higher parts of text would likely depend heavily on it). If we accept GPT-4 as increasingly human-like when it comes to fast thinking, just where does AI research today stand with regard to the slow kind? And for that matter, where does it stand in regard to the integration of the two in a useful manner? Granting the case made here this would seem the next issue to think about--with the state and rate of progress determining whether the advance of AI research that has seemed so blistering in recent months accelerates, slows or even comes to another screeching halt of the kind that has renewed cynicism toward the field time and time again.

No comments:

Subscribe Now: Feed Icon