Please use a No. 2 pencil. You have one hour. There will be no bathroom breaks.
Incanter and I have been engaging in that popular past-time of many gamers, talking about Artificial Intelligence in games. And when I say talking, I of course mean complaining. True, compared with the dark ages of the early 90s when I began taking games seriously, there have been some notable improvements in gaming AI. For the most part FPS enemies no longer simply leap out from behind corners and stand there blazing away at you with a pistol while you saw them in half with a mini-gun (there are, of course, always unfortunate exceptions, (cough) Doom 3 (cough)). They don’t usually stand there contemplating the transcendent beauty of the texture maps while you walk up unnoticed beside them and put a cap in their arse. The quality of sidekick characters has improved markedly and there have also been dramatic improvements in the area of the kind of unit AI you find in the best RTS games. However, I think it’s fair to say that gaming AI has really evolved only from the level of bloody annoying to that of not too aggravating. I feel about smart, satisfying game AI the way people who grew up in the 1950s must feel about personal jetpacks: what the hell happened?
It might be useful at this juncture, then, to ask whether or not some of the assumptions informing the quest to develop sophisticated game AI are in need of an overhaul. I want to start, however, by looking at a slightly different issue: the tests used to evaluate the intelligence of artificial entities.
There is, of course, the Turing Test, the best known instantiation of which is probably the Loebner prize established in 1990: a grand prize of $100,000 to be awarded to an AI that is indistinguishable from a human. Needless to say, the grand prize has never been awarded and every year people fight it out for the $3,000 consolation prize for the entity most like a human. (The 2009 contest was held on September 6 and the results have not yet been announced). The details of the contest have varied slightly over the years, but it always seems to return to the “classic” format: a judge faces off against both a human and an AI and tries to guess which is which. People obviously expend blood and treasure on this endeavour, but the abilities of even the winning chatterbots are less than inspiring and those of the losing ones are downright embarrassing. Ironically, the last place bot in the 2008 contest was programmed with a religious personality (and, according to the transcripts, Brother Jerome spent most of his time not responding at all–perhaps the bot should instead have been called God?) while the eventual winner, Elbot, apparently fooled a couple of judges. . .despite having the programmed persona of a robot. (You can judge Elbot’s conversational chops for yourself).
Now there is probably a small fortune awaiting the first person to develop a convincingly human chatterbot. That way someone can install a machine with a limited ability to speak English and an even more limited ability to understand it into customer phone support positions and dispense with the expensive intermediary step of having to turn real human beings into unhelpful machines. But, for the most part the success or failure of this kind of Turing test is irrelevant to the concerns of designing game AI.
I am, however, interested in the test conditions used by the Loebner prize and the degree to which they stack the deck against the AI. These parameters are in fact representative of other attempts to evaluate AIs, including those more specific to gaming: they are less concerned with meaningfully evaluating the ability of an AI to imitate a human than with maintaining a commonplace (and, I would add, overly optimistic) belief in the sophistication of human social interaction.
The question we should be asking is not if an AI can imitate a human being, but under what conditions? For example, as mediated human communication approaches more closely the condition of machine-generated gobbeldygook the likelihood for an AI to fool a human increases. If the test were based around tweets or text messages I’d expect an AI to do pretty well. (Interestingly, the first winner of the Loebner prize won, somewhat controversially, by being able to mimic the typing and grammatical errors of a human being).
The way the Loebner test (and others, something I will explore in a subsequent post) is set up, however, it is humans that are being tested, not the AI: what is being evaluated is not a bot’s ability to fool a human but the ability of a human to distinguish between a bot and a human. The Loebner prize test conditions, while claiming to test the ability of a bot to engage in a naturalistic conversation, therefore employ a highly artificial conversational set-up. There are only ever two possible conversation partners, and the judge converses with each, mano a mano (or mano a micro), in turn. The judge is (almost always) aware that one of them is non-human (and you can see the judge and the human partner making reference to this in many of the contest transcripts). The judge is closely scrutinizing every utterance in order to determine whether or not their conversational partner is non-human.
If this is your everyday conversational reality then you are either locked in a double wide somewhere in Kansas feverishly updating your blackhelicoptersarecoming.org blog, or have a serious Halo addiction for which you need to seek immediate help before you harm yourself and/or others. Personally, I don’t have a lot of conversations that involve me trying to determine if one of my friends is more human than the other (some, sure, but not many).
If you were really interested in testing the ability of the AI to imitate human communication, wouldn’t you structure a less predictable test? You might, for example, mix it up a bit. Sometimes the human judge would be facing one human and one bot; sometimes they might be facing two humans, sometimes two bots, and they would never know which combination they were facing. Perhaps the judges would occasionally be faced with three entities. Or, you could even make the test a really high stakes one. You, the judge, are interacting with only one entity: tell me if it is human or not. You can see how all these combinations might complicate things.
What the Loebner contest focuses on is a model of human communication that is content rich but context limited. AIs fail this kind of Turing test with monotonous regularity because they are expected to provide full and satisfying responses on a wide variety of potential conversational topics and to do so in a fashion that indicates attentiveness to the needs of their conversational partner. This is what most people would probably think of as the basis of real human communication. However this expectation of subject-oriented (in two senses) sophistication is purchased only through creating a restrictive, artificial conversational framework. In everyday human converse, how many of the following apply?
- The stakes are high; a lot rides on the outcome of the particular conversation;
- Your conversational partner has your fierce, undivided attention and you treat their utterances as if you have (or should have) a similar degree of attention from them;
- The purpose of the conversation is to compare their utterances with those from someone else;
- The comparison is, furthermore, based not on the truth or usefulness of the information imparted by your conversational partners but on the degree to which their utterances qualify as syntactically, logically, and situationally valid.
Obviously this represents a highly idealized view of “standard” human conversation. Indeed, most human conversations would probably fail such a Turing test.
In my next post I want to look at how this kind of Turing test compares with one method for evaluating game AI: the Botprize.
—Twitchdoctor
This is a very good point. The goal of the Turing test is to measure the anthropomorphic-ness of the AI, but this is only determined by the human judge’s ability to distinguish a human from an AI. The judge’s ability to make this distinction is influenced by a few factors. First, is the human control playing fairly? There have been instances of people behaving less human to make the judge’s task more difficult. The second factor is the freedom of expression enabled by the interaction model. For instance, an AI controlling a Pong paddle has a much lower bar to be indistinguishable from a human player than one that must interact in a natural conversation via chat. But as you pointed out, this is not really a natural conversation, it is an interrogation of the AI; I agree this makes the results less interesting. The third factor is the complexity of the environment, real and virtual, where the test takes place. It is much easier for a judge to distinguish a human from an AI when the two are alone together in a (virtual) room than if they were in a crowded (virtual) room with lots of conversations going on simultaneously.
I think your upcoming post on the 2K BotPrize competition should be interesting. It’s another example of a Turing test, but one where the interaction model offers less expressivity than a natural language chat, and the environment where the test takes place is more complex (i.e. a first person shooter game).
Hmm. I like that concept of “freedom of expression.” I’ll have to think about that some more. My first response was that such an idea is more in tune with what I’ve called the subject-oriented focus of such tests, and which you’ve called, perhaps more accurately, the focus on the “anthrop0morphic-ness” of the AI. But it’s also used in the biological sense (and maybe a programming sense for all I know), as in the expression of particular traits. In that sense, freedom of expression would have an interesting connection with your idea about AI and personality types.