John DiMarco on Computing (and occasionally other things)
I welcome comments by email to jdd at cs.toronto.edu.

Sat 06 Jan 2024 21:41

How smart is GPT-4 anyway?

Cartoon screenshot of a human conversing with a chatbot
Image by Alexandra Koch from Pixabay

I got some helpful feedback about my last blog article, which mentions ChatGPT-4 in the context of the 50th anniversary of Social Issues in Computing and its predictions about AI. Robert DiMarco pointed out that while a chatbot can respond as if it is reasoning, because it is essentially a black box where one can't look inside to see how it is coming up with its answer, one can't know for certain whether or not it is actually using reasoning to come up with the answer. He is right, of course. But the same holds for people too (we just know better how people generally do things, because we are people ourselves). So how might we check if a person is using reasoning? We might ask the person some questions, and look for responses that contain information that would generally require reasoning to deduce.

I tried asking such questions to ChatGPT-4. First I tried something from popular culture, in this case, "Star Wars":

Me: In "Star Wars", Darth Vader responds to an accusation of murder by Luke Skywalker. This response has become a famous movie line. What is the response?

ChatGPT-4: In "Star Wars," particularly in "The Empire Strikes Back," Darth Vader's famous response to Luke Skywalker's accusation is, "No, I am your father." This line is one of the most iconic and frequently quoted in film history, often misquoted as "Luke, I am your father."

Me: Is there another character who, making the same accusation, could elicit the same response?

ChatGPT-4: Yes, if Leia Organa, Luke Skywalker's twin sister, made the same accusation to Darth Vader, he could technically give the same response, "No, I am your father." This is because, as revealed in the Star Wars saga, Darth Vader, formerly known as Anakin Skywalker, is the father of both Luke and Leia.

Me: If Kylo Ren made the accusation, how might Vader modify his response?

ChatGPT-4: If Kylo Ren, who is Darth Vader's grandson, made a similar accusation, Darth Vader's response would likely be modified to reflect their relationship. He might say something like, "No, I am your grandfather." Kylo Ren, whose real name is Ben Solo, is the son of Leia Organa and Han Solo, making him the grandson of Darth Vader.

As people who know "Star Wars" will realize, ChatGPT-4's answers are correct. I chose this example because it requires understanding the quote well enough to know it is referring to a family relationship between characters, and it also requires knowledge about how family relationships work. ChatGPT-4's responses are no different from what an intelligent person who knows the Star Wars characters might answer, though in answer to the third question, the person might go on to point out, for instance, that Kylo Ren was born after Vader's death, making the conversation unlikely.

Then I thought I would ask GPT something about Star Wars that is perhaps a bit more unexpected, something that would be immediately obvious to a person (at least an english-speaking one), but not necessarily to a computer. So I asked this:

Me: If Obi-Wan Kenobi is a character in Star Wars, why is there no Obi-Two Kenobi? Why might someone ask this question?

ChatGPT-4: The question about "Obi-Two Kenobi" in the context of "Star Wars" is likely a playful or humorous one, playing on the word "One" in "Obi-Wan Kenobi's" name as if it implies a numerical sequence. In reality, "Obi-Wan" is a unique name in the "Star Wars" universe, not intended to suggest a numerical series. The names in "Star Wars" are often creative and unique, following the tradition of fantasy and science fiction naming conventions, rather than real-world naming patterns.

Someone might ask this question as a joke, riffing on the idea that if there's an "Obi-Wan" (sounding like "One"), there could humorously be an "Obi-Two," "Obi-Three," and so on. It's a bit of wordplay, often enjoyed by fans of the series who like to engage with the franchise in a light-hearted way.

This answer is correct too. So are these the sorts of answers you might expect from a human doing reasoning? I think Yes. So while I agree we cannot prove ChatGPT-4 is reasoning, it is giving responses to my questions similar to what a person doing reasoning would give.

Another helpful comment came from Allan Borodin, who pointed out that the test example I considered in my previous article, about the ball crashing through the table, may be a well-known example in the literature. Even if not, it has been available via the Internet for at least a decade, as it was described in detail by Hector Levesque in his 2013 article for the 40th anniversary Social Issues in Computing blog, which is the reason I used it. Allan wondered whether ChatGPT-4 had been trained on literature containing this example and had learned from it the appropriate response. This seemed to me to be a possibility worth checking. The thing to do, I thought, is to come up with something GPT-4 had most likely not seen in its training data, and see if its answers look like those a human doing reasoning might give. So I made up a new Levesque-like type of question and asked variants of it to ChatGPT-4:

Me: The balloon and the corn touched and it popped. What popped?

ChatGPT-4: The corn popped. In this scenario, when the balloon and the corn touched and something popped, it is more likely that the corn popped, as in turning into popcorn, rather than the balloon bursting. This interpretation assumes that the corn was heated to the point of popping, a common occurrence when making popcorn.

Me: The balloon and the hot metal pot touched and it popped. What popped?

ChatGPT-4: In this scenario, where the balloon and the hot metal pot touched and something popped, it is most likely that the balloon popped. Balloons are typically made of materials like latex or mylar, which are sensitive to heat. When a balloon comes into contact with a hot metal pot, the heat would cause the material to weaken and burst, resulting in the balloon popping.

Me: The corn and the hot metal pot touched and it popped. What popped?

ChatGPT-4: In this scenario, where the corn and the hot metal pot touched and something popped, it is most likely that the corn popped. When corn kernels are heated, especially in a hot metal pot, they eventually pop, turning into popcorn. This is a common method for cooking popcorn.

Me: The cold water and the hot metal pot touched and it popped. What popped?

ChatGPT-4: In the scenario where cold water and a hot metal pot touched and something popped, it's likely that the sound of popping or a sizzling noise was created by the interaction between the cold water and the hot pot. When cold water comes into contact with a very hot metal surface, it can rapidly boil and vaporize, creating a sizzling or popping sound. However, in this context, nothing physical "pops" in the way a balloon or corn kernel might; it's more about the sound produced by the sudden temperature change and steam generation.

Me: The cold water and the pressure cooker touched and it popped. What popped?

ChatGPT-4: In the scenario where cold water and a pressure cooker touched and something popped, it's most likely that the pressure cooker released pressure, creating a popping sound. Pressure cookers are designed to cook food under high pressure. When cold water is applied to the exterior of a hot pressure cooker (a method sometimes used to quickly reduce internal pressure), it can cause the pressure valve or seal to release steam rapidly, resulting in a popping or hissing noise. This is a common safety feature to prevent over-pressurization. The pop is not from the water or the cooker itself, but from the sudden release of steam and pressure.

Me: The balloon and the unlit flare touched and it popped. What popped?

ChatGPT-4: In the scenario where a balloon and an unlit flare touched and something popped, it is most likely that the balloon popped. Since the flare was unlit, it wouldn't provide a heat source or any other typical means to cause a pop. However, balloons can pop due to various reasons such as physical pressure, sharp edges, or even static electricity. The texture or some component of the flare might have been enough to puncture the balloon's surface, leading to it popping.

Me: What if the flare were lit?

ChatGPT-4: If the flare were lit and it touched the balloon, it would almost certainly be the balloon that popped. A lit flare produces intense heat and flames, and when it comes into contact with a balloon, the heat would cause the material of the balloon (typically latex or a similar substance) to weaken rapidly and burst. Balloons are very sensitive to heat, and the intense heat from a lit flare would easily be enough to cause a balloon to pop.

Me: The balloon and the weasel touched and it popped. What popped?

ChatGPT-4: In the scenario where a balloon and a weasel touched and something popped, it is likely that the balloon popped. This scenario might be referencing the expression "pop goes the weasel," which is a line from a well-known nursery rhyme. In real life, if a weasel (a small, active mammal) came into contact with a balloon, the balloon could easily be punctured and pop, especially if the weasel's claws or teeth made contact with it. The weasel itself wouldn't pop; it's the balloon that is vulnerable in this interaction.

Me: The balloon and the instawargia touched and it popped. What popped?

ChatGPT-4: "Instawargia" is not a recognized or common term in English, and it does not refer to a known object, concept, or entity. Because of this, it's challenging to determine what "instawargia" is and how it would interact with a balloon. However, given that balloons are generally fragile and prone to popping when in contact with unfamiliar or potentially sharp or pointed objects, it's reasonable to assume that in this scenario, the balloon is what popped. Without specific information about what "instawargia" is, this is a speculative conclusion based on the general nature of balloons.

This set of responses is interesting. Note that ChatGPT-4 gets the answer to the first question wrong. If corn and a balloon were to touch, and one or the other popped, most people realize this is much more likely to happen in a cornfield than a popcorn popper, where the balloon, not the corn, would be the thing that pops. Seeing this, I tried the same question with different types of things, for different definitions of "pop". I even tried making up a nonexistent thing (instawargia) to see what GPT would do with it, but the first question was the only one that ChatGPT-4 got wrong. Interestingly, its reasoning there wasn't completely incorrect: if corn were heated to the point of popping, it could pop if touched. But ChatGPT-4 misses the fact that if heat were present, as it surmises, the balloon would be even more likely to pop, as heat is a good way to pop balloons, and yet it points out this very thing in a later answer.

So what does this show? To me, I see a set of responses that if a human were to give them, would require reasoning. That one of the answers is wrong suggests to me only that the reasoning is not being done perfectly, not that there is no reasoning being done. So how smart is ChatGPT-4? It is clearly not a genius, but it appears to be as smart as many humans. That's usefully smart, and quite an achievement for a computer to date.

/it permanent link


Blosxom