Gain 5 points for each correct answer. Lose 3 points for each wrong answer. Lose 4 points for each question skipped. Gain 6 points for attempting an answer then giving up in disgust.
In my previous post I argued that while the best-known instance of the Turing Test, the Loebner Prize, is ostensibly set up to evaluate the ability of an AI bot to fool a human in simulated conversation, the parameters of the competition focus more on testing a human being’s ability to differentiate between human and machine.
If we turn to the world of electronic games we find something very similar, albeit with some revealing differences. In December 2008 Aussie developer 2K Games sponsored the inaugural Botprize, the “Turing Test for Bots;” a second iteration of the contest has just been played out in Milan. The contest, held in conjunction with the IEEE Symposium on Computational Intelligence and Games, is designed to test the ability of a bot to pass as a human player of a first-person shooter. The format is once again the classic Turing model: a judge faces off against a human and a bot in a deathmatch shootout using a modified version of Unreal Tournament (2004). The test is operating in a different ballpark than the Loebner prize (its more of a neighborhood sandlot, really) with its offer of a cash prize of only $7,000 and a trip to 2K’s Canberra studio. To win the major prize a team needs to fool 80% (typically 4 out of 5) of the judges. As we might expect from the long inglorious history of the Loebner prize, no one has come close to grabbing the major award, which leaves everyone fighting it out for the minor money: $2000 and a trip to the studio for the team whose bot is judged to have the highest average “human-ness” (their word, not mine, I swear).
To cut a long, but predictable, story short, the bots fail. Miserably. In 2008, 2 of the five bots failed to convince a single judge. Two bots convinced only two of the judges. While complete results have yet to be posted for the 2009 prize, the bots as a whole did a little better, with each fooling at least one of the 5 judges. Woohoo.
Now on the face of it this looks like a very simple challenge. Whichever player kills you and then takes the time to teabag you, that’s the human. (There’s an idea; let’s replace the Turing Test with the Teabag Test: the winner for the Loebner prize under these rules would be the bot that convincingly spews random homophobic insults at you at the slightest provocation). But seriously folks. . .
The frenetic pace of an online deathmatch does make picking the bot in each round a daunting task for the casual gamer. (You can check out a series of short videos from the 2008 contest and try it for yourself). However the judges’ notes indicate that they have a series of behaviors that they are looking for: reliance on a single weapon, losing track of their target, failing to pursue a target, for example, can all be telltale signs of a bot. However the Botprize as a whole suffers from the same weaknesses as the Loebner prize. In every round the judge always knows that one of the avatars they will be facing is nonhuman which makes it a contest more focused on their skills at differentiating machines from humans (something that is tacitly acknowledged by a “best judge” award). Although it is entirely possible to run this test with different configurations (two humans, two bots, and the judge always in the dark) there doesn’t appear to be any interest in employing such a more methodologically varied test.
However, while this form of traditional Turing test applied to chatterbots produces a completely artificial and constrained conversational context that bears little relationship to real human conversation, the method does, it is true, have some marginal utility when evaluating bot/human performance in the world of multiplayer FPS games. After all, in the world of online gaming, cheating tools like speed or damage hacks are common enough that most players are likely to have experienced them firsthand or heard of them. Thus, while trying to figure out whether the entity you are facing is human or not has no relevance to everyday human conversation, wondering about the possibly enhanced or downright artificial nature of the player you are facing in a game is a distinct possibility!
It is also important to note that the AI design task in each of these Turing tests is very different. In the Loebner prize, designers are faced with the task of “smartening up” their AI to make it capable of holding the kind of relatively sophisticated conversational exchanges that are, somewhat romantically, envisaged to be the stuff of everyday human interaction. When it comes to FPS games, however, it is relatively easy to design AI characters that are all powerful super-soldiers. Many of us have played games with this kind of AI design (usually not for very long). This is the NPC that can kill you with a single headshot from 500 metres while standing on their head with the weapon clenched firmly between their butt cheeks. Gamers just love that kind of “smart” AI. The challenge for the Botprize designers, therefore, is to dumb the AI down, to make it play more like a fallible human.
Nevertheless, there remains this reluctance in either of these Turing tests to provide a more methodologically varied test and it is fair to ask why. Part of the reason is undoubtedly that the Turing Test has acquired the status of Holy Writ amongst AI geeks. Despite the fact that there is some debate as to what the parameters actually were when Alan Turing first postulated the idea of testing a machine’s ability to play the imitation game, rewriting the “rules” seems to be regarded by people as akin to rewriting the ten commandments to remove, say, that pesky adultery clause: it would make life a lot easier and more interesting but, you know, it’s just not done!
There is another, more important reason, and it is indicated by a less obvious result of the 2008 Botprize. Of the human players involved in the contest, 2 managed to convince only 2 of the judges that they were in fact human. Of the five players, only one convinced all five judges that he was human. These Turing tests are not designed around criteria for meaningfully evaluating AIs, they are instead designed around a set of criteria that is supposed to define what is believed to constitute human behavior, either in a conversational or a gaming context. What I suspect people are reluctant to acknowledge, however, is that these criteria are, at best, highly romanticized, and at worst, complete BS. Most human conversational interaction, for example, is completely unlike that imagined by the Loebner prize. Rather than being focused, intense, and driven by evaluative need, most everyday conversations are trivial, characterized by a high degree of inattention, consist mostly of filler, and have no purpose except to keep open channels of communication. Most people just don’t have much that is worth saying and they spend their time saying it badly but saying it a lot.
Were the Loebner prize and the Botprize to be run in a more methodologically sound fashion, I would hazard a guess that one immediate result would be that the number of “humans” who were determined to be machines would rise dramatically, certainly in the case of the Botprize. The patently limited parameters in both these Turing tests, in other words, are designed to prevent us from finding out how truly awful we are at attempting to affirm and enforce the criteria that supposedly render humans distinctive. More disturbingly (or intriguingly, depending on your point of view) it might show how inclined we are already to see one another as species of machine.
—Twitchdoctor
For all I know, this is the point of your next installment, but what would be a more accurate way to test AI? If the problem as you described, judges evaluating the ability of an AI to fool a human and the test being a romanticized version of human behavior, how can this problem be removed?
I’d suggest throwing a giant LAN party and making a few of the participants AI instead of humans. As you pointed out oh so cleverly, players are always on the lookout for possibly enhanced play styles. You can tell if the AI is successful by how many players lash out in a fit of rage, yelling “HAX!” all the way.
Of course, players also lash out in fits of rage and yell “HAX” at humans as well. . . 😉
I think it would be possible to design a much better version of the both the Loebner prize and the botprize. I bet I’m not the first to point out how methodologically suspect these tests are. I just don’t think there’s any interest in doing so. The way it stands now, we all get to feel warm and fuzzy about how special our supposedly non-replicable human characteristics are, sigh about how AI research still has a long way to go, and life rolls merrily along.
Before you design a test you need to decide on what it is you are testing for. These are not simply problematic tests, they are testing for a kind of AI that I’m not sure is that useful overall and certainly not for games. However your idea of the LAN party (which is certainly a realistic gaming environment) does connect with something that Incanter and I have been talking about: the role of complex environments in attributing intelligence to systems.
Forgive the following nerdy tangent, but I feel that some conversations are incomplete without Star Trek references. A very famous test appeared in appeared in the second film (Star Trek II: The Wrath of Khan)—the Kobiyashi Maru test. This test was defined by its impossibility—its purpose was to make cadets feel fear and experience indecision while facing certain death. It was not a winnable simulation because it was not meant to evaluate a cadet’s ability to make the right decisions and win—it was meant to evaluate the cadet’s character in the face of adversity.
After failing to beat the Kobiyashi Maru test, cadets voiced their concerns to instructors that the test did not sufficiently showcase or evaluate their ability to command a starship. That frustration is similar to what we are feeling about these Turing tests. We want the tests to motivate the production of better AI to supplement the development of better games, but what the Loebner test really motivates is the production of an AI that as Twitchdoctor asserts, really may not even be useful or have a place in the games we play.
As consumers of games that we want to be intelligently artificial (brilliant title for this blog by the way), we play the role of the frustrated cadet—frustrated because the tests are evaluating the wrong things. In their position to do wonders for AI development, these tests examine the traits unrelated to our (gamers’) concerns. Just as the cadet wants to show off his or her ‘booksmart’ knowledge of how to respond to specific scenarios, we want to see AI developments suited to the gaming community. And I think we’re rightly disappointed that these tests are examining the criteria that aren’t really associated with gaming—especially because they conduct the tests under the guise that they will somehow advance AI developments for games.
Before I become too harsh a critic, I should say that like everyone else, I’m interested to see the outcome of the test when it’s conducted in a more game-like setting (moving to first-person shooters instead of simulated conversations). Still though, the methods of judging these AI tests are questionable at best, and if the tests were truly conducted in order to better AI for games, I agree that it’s more than reasonable to question these methods until they’re changed.
Someone may find a use for an AI entity that is essentially human, but that kind of AI (as Twitchdoctor suggests,) isn’t necessarily important to gaming. It’s not that the Turing tests are useless, it’s not that the Kobiyashi Maru test doesn’t teach an important lesson, it’s that these tests pretend to be something that they aren’t—they are preceded by an expectation that they don’t fulfill.
waltersthegreat
You are right, no conversation would be complete without such a reference! I really like the Kobayashi Maru example in connection with this discussion for a couple of reasons. It is a test, true, but it is, more importantly, a simulation. Simulations can obviously function as tests (airline pilots regularly undergo different kinds of certification in simulators) so they can have evaluative components. But the main goal is for the subject taking the simulation to learn things about his or her own behavior. In Wrath of Khan Lieutenant Saavik is annoyed, as you point out, because she expects that this will be a conventional “abilities and skills” test. However it turns out to be something more nebulous: a test of character. The important difference between this kind of test and the Loebner and botprizes is that the people running the simulation are pretty clear about what the test is supposed to do and what the results will mean.
I also liked how they used this test again in the most recent Star Trek movie, where we learn the full details of the legend of Kirk beating the test. Now from one point of view he cheats. But what he does is simply broaden the frame of the test. He interprets the challenge as being not just the simulation itself, but the need to beat the structure that has created and is administering the simulation. As the movie demonstrates, by “cheating” Kirk in fact fulfills the purpose of the test: he demonstrates something essential about his character when faced with overwhelming odds and the certainty of death. The expected outcome was acquiescence in the face of certain destruction; Kirk’s response is to assume that the test is rigged and to look for a way to un-rig it, as it were. (To throw in another nerdy reference, its kind of a Han Solo moment: “Never tell me the odds!”)
If the Loebner prize and the Botprize were to be reworked, I think it would require just this kind of broadening/shifting of the frame. Think how interesting, it would be to have an AI that “cheats.” I’m not even sure what that would mean.