Gain 5 points for each correct answer. Lose 3 points for each wrong answer. Lose 4 points for each question skipped. Gain 6 points for attempting an answer then giving up in disgust.
In my previous post I argued that while the best-known instance of the Turing Test, the Loebner Prize, is ostensibly set up to evaluate the ability of an AI bot to fool a human in simulated conversation, the parameters of the competition focus more on testing a human being’s ability to differentiate between human and machine.
If we turn to the world of electronic games we find something very similar, albeit with some revealing differences. In December 2008 Aussie developer 2K Games sponsored the inaugural Botprize, the “Turing Test for Bots;” a second iteration of the contest has just been played out in Milan. The contest, held in conjunction with the IEEE Symposium on Computational Intelligence and Games, is designed to test the ability of a bot to pass as a human player of a first-person shooter. The format is once again the classic Turing model: a judge faces off against a human and a bot in a deathmatch shootout using a modified version of Unreal Tournament (2004). The test is operating in a different ballpark than the Loebner prize (its more of a neighborhood sandlot, really) with its offer of a cash prize of only $7,000 and a trip to 2K’s Canberra studio. To win the major prize a team needs to fool 80% (typically 4 out of 5) of the judges. As we might expect from the long inglorious history of the Loebner prize, no one has come close to grabbing the major award, which leaves everyone fighting it out for the minor money: $2000 and a trip to the studio for the team whose bot is judged to have the highest average “human-ness” (their word, not mine, I swear).
To cut a long, but predictable, story short, the bots fail. Miserably. In 2008, 2 of the five bots failed to convince a single judge. Two bots convinced only two of the judges. While complete results have yet to be posted for the 2009 prize, the bots as a whole did a little better, with each fooling at least one of the 5 judges. Woohoo.
Now on the face of it this looks like a very simple challenge. Whichever player kills you and then takes the time to teabag you, that’s the human. (There’s an idea; let’s replace the Turing Test with the Teabag Test: the winner for the Loebner prize under these rules would be the bot that convincingly spews random homophobic insults at you at the slightest provocation). But seriously folks. . .
The frenetic pace of an online deathmatch does make picking the bot in each round a daunting task for the casual gamer. (You can check out a series of short videos from the 2008 contest and try it for yourself). However the judges’ notes indicate that they have a series of behaviors that they are looking for: reliance on a single weapon, losing track of their target, failing to pursue a target, for example, can all be telltale signs of a bot. However the Botprize as a whole suffers from the same weaknesses as the Loebner prize. In every round the judge always knows that one of the avatars they will be facing is nonhuman which makes it a contest more focused on their skills at differentiating machines from humans (something that is tacitly acknowledged by a “best judge” award). Although it is entirely possible to run this test with different configurations (two humans, two bots, and the judge always in the dark) there doesn’t appear to be any interest in employing such a more methodologically varied test.
However, while this form of traditional Turing test applied to chatterbots produces a completely artificial and constrained conversational context that bears little relationship to real human conversation, the method does, it is true, have some marginal utility when evaluating bot/human performance in the world of multiplayer FPS games. After all, in the world of online gaming, cheating tools like speed or damage hacks are common enough that most players are likely to have experienced them firsthand or heard of them. Thus, while trying to figure out whether the entity you are facing is human or not has no relevance to everyday human conversation, wondering about the possibly enhanced or downright artificial nature of the player you are facing in a game is a distinct possibility!
It is also important to note that the AI design task in each of these Turing tests is very different. In the Loebner prize, designers are faced with the task of “smartening up” their AI to make it capable of holding the kind of relatively sophisticated conversational exchanges that are, somewhat romantically, envisaged to be the stuff of everyday human interaction. When it comes to FPS games, however, it is relatively easy to design AI characters that are all powerful super-soldiers. Many of us have played games with this kind of AI design (usually not for very long). This is the NPC that can kill you with a single headshot from 500 metres while standing on their head with the weapon clenched firmly between their butt cheeks. Gamers just love that kind of “smart” AI. The challenge for the Botprize designers, therefore, is to dumb the AI down, to make it play more like a fallible human.
Nevertheless, there remains this reluctance in either of these Turing tests to provide a more methodologically varied test and it is fair to ask why. Part of the reason is undoubtedly that the Turing Test has acquired the status of Holy Writ amongst AI geeks. Despite the fact that there is some debate as to what the parameters actually were when Alan Turing first postulated the idea of testing a machine’s ability to play the imitation game, rewriting the “rules” seems to be regarded by people as akin to rewriting the ten commandments to remove, say, that pesky adultery clause: it would make life a lot easier and more interesting but, you know, it’s just not done!
There is another, more important reason, and it is indicated by a less obvious result of the 2008 Botprize. Of the human players involved in the contest, 2 managed to convince only 2 of the judges that they were in fact human. Of the five players, only one convinced all five judges that he was human. These Turing tests are not designed around criteria for meaningfully evaluating AIs, they are instead designed around a set of criteria that is supposed to define what is believed to constitute human behavior, either in a conversational or a gaming context. What I suspect people are reluctant to acknowledge, however, is that these criteria are, at best, highly romanticized, and at worst, complete BS. Most human conversational interaction, for example, is completely unlike that imagined by the Loebner prize. Rather than being focused, intense, and driven by evaluative need, most everyday conversations are trivial, characterized by a high degree of inattention, consist mostly of filler, and have no purpose except to keep open channels of communication. Most people just don’t have much that is worth saying and they spend their time saying it badly but saying it a lot.
Were the Loebner prize and the Botprize to be run in a more methodologically sound fashion, I would hazard a guess that one immediate result would be that the number of “humans” who were determined to be machines would rise dramatically, certainly in the case of the Botprize. The patently limited parameters in both these Turing tests, in other words, are designed to prevent us from finding out how truly awful we are at attempting to affirm and enforce the criteria that supposedly render humans distinctive. More disturbingly (or intriguingly, depending on your point of view) it might show how inclined we are already to see one another as species of machine.