Please use a No. 2 pencil. You have one hour. There will be no bathroom breaks.
Incanter and I have been engaging in that popular past-time of many gamers, talking about Artificial Intelligence in games. And when I say talking, I of course mean complaining. True, compared with the dark ages of the early 90s when I began taking games seriously, there have been some notable improvements in gaming AI. For the most part FPS enemies no longer simply leap out from behind corners and stand there blazing away at you with a pistol while you saw them in half with a mini-gun (there are, of course, always unfortunate exceptions, (cough) Doom 3 (cough)). They don’t usually stand there contemplating the transcendent beauty of the texture maps while you walk up unnoticed beside them and put a cap in their arse. The quality of sidekick characters has improved markedly and there have also been dramatic improvements in the area of the kind of unit AI you find in the best RTS games. However, I think it’s fair to say that gaming AI has really evolved only from the level of bloody annoying to that of not too aggravating. I feel about smart, satisfying game AI the way people who grew up in the 1950s must feel about personal jetpacks: what the hell happened?
It might be useful at this juncture, then, to ask whether or not some of the assumptions informing the quest to develop sophisticated game AI are in need of an overhaul. I want to start, however, by looking at a slightly different issue: the tests used to evaluate the intelligence of artificial entities.
There is, of course, the Turing Test, the best known instantiation of which is probably the Loebner prize established in 1990: a grand prize of $100,000 to be awarded to an AI that is indistinguishable from a human. Needless to say, the grand prize has never been awarded and every year people fight it out for the $3,000 consolation prize for the entity most like a human. (The 2009 contest was held on September 6 and the results have not yet been announced). The details of the contest have varied slightly over the years, but it always seems to return to the “classic” format: a judge faces off against both a human and an AI and tries to guess which is which. People obviously expend blood and treasure on this endeavour, but the abilities of even the winning chatterbots are less than inspiring and those of the losing ones are downright embarrassing. Ironically, the last place bot in the 2008 contest was programmed with a religious personality (and, according to the transcripts, Brother Jerome spent most of his time not responding at all–perhaps the bot should instead have been called God?) while the eventual winner, Elbot, apparently fooled a couple of judges. . .despite having the programmed persona of a robot. (You can judge Elbot’s conversational chops for yourself).
Now there is probably a small fortune awaiting the first person to develop a convincingly human chatterbot. That way someone can install a machine with a limited ability to speak English and an even more limited ability to understand it into customer phone support positions and dispense with the expensive intermediary step of having to turn real human beings into unhelpful machines. But, for the most part the success or failure of this kind of Turing test is irrelevant to the concerns of designing game AI.
I am, however, interested in the test conditions used by the Loebner prize and the degree to which they stack the deck against the AI. These parameters are in fact representative of other attempts to evaluate AIs, including those more specific to gaming: they are less concerned with meaningfully evaluating the ability of an AI to imitate a human than with maintaining a commonplace (and, I would add, overly optimistic) belief in the sophistication of human social interaction.
The question we should be asking is not if an AI can imitate a human being, but under what conditions? For example, as mediated human communication approaches more closely the condition of machine-generated gobbeldygook the likelihood for an AI to fool a human increases. If the test were based around tweets or text messages I’d expect an AI to do pretty well. (Interestingly, the first winner of the Loebner prize won, somewhat controversially, by being able to mimic the typing and grammatical errors of a human being).
The way the Loebner test (and others, something I will explore in a subsequent post) is set up, however, it is humans that are being tested, not the AI: what is being evaluated is not a bot’s ability to fool a human but the ability of a human to distinguish between a bot and a human. The Loebner prize test conditions, while claiming to test the ability of a bot to engage in a naturalistic conversation, therefore employ a highly artificial conversational set-up. There are only ever two possible conversation partners, and the judge converses with each, mano a mano (or mano a micro), in turn. The judge is (almost always) aware that one of them is non-human (and you can see the judge and the human partner making reference to this in many of the contest transcripts). The judge is closely scrutinizing every utterance in order to determine whether or not their conversational partner is non-human.
If this is your everyday conversational reality then you are either locked in a double wide somewhere in Kansas feverishly updating your blackhelicoptersarecoming.org blog, or have a serious Halo addiction for which you need to seek immediate help before you harm yourself and/or others. Personally, I don’t have a lot of conversations that involve me trying to determine if one of my friends is more human than the other (some, sure, but not many).
If you were really interested in testing the ability of the AI to imitate human communication, wouldn’t you structure a less predictable test? You might, for example, mix it up a bit. Sometimes the human judge would be facing one human and one bot; sometimes they might be facing two humans, sometimes two bots, and they would never know which combination they were facing. Perhaps the judges would occasionally be faced with three entities. Or, you could even make the test a really high stakes one. You, the judge, are interacting with only one entity: tell me if it is human or not. You can see how all these combinations might complicate things.
What the Loebner contest focuses on is a model of human communication that is content rich but context limited. AIs fail this kind of Turing test with monotonous regularity because they are expected to provide full and satisfying responses on a wide variety of potential conversational topics and to do so in a fashion that indicates attentiveness to the needs of their conversational partner. This is what most people would probably think of as the basis of real human communication. However this expectation of subject-oriented (in two senses) sophistication is purchased only through creating a restrictive, artificial conversational framework. In everyday human converse, how many of the following apply?
- The stakes are high; a lot rides on the outcome of the particular conversation;
- Your conversational partner has your fierce, undivided attention and you treat their utterances as if you have (or should have) a similar degree of attention from them;
- The purpose of the conversation is to compare their utterances with those from someone else;
- The comparison is, furthermore, based not on the truth or usefulness of the information imparted by your conversational partners but on the degree to which their utterances qualify as syntactically, logically, and situationally valid.
Obviously this represents a highly idealized view of “standard” human conversation. Indeed, most human conversations would probably fail such a Turing test.
In my next post I want to look at how this kind of Turing test compares with one method for evaluating game AI: the Botprize.