Skip to main content

Inside Amazon’s $3.5 million competition to make Alexa chat like a human

Illustrations by Alex Castro

Onstage at the launch of Amazon’s Alexa Prize, a multimillion-dollar competition to build AI that can chat like a human, the winners of last year’s challenge delivered a friendly warning to 2018’s hopefuls: your bot will mess up, it will say something offensive, and it will be taken offline. Elizabeth Clark, a member of last year’s champion Sounding Board team from the University of Washington, was onstage with her fellow researchers to share what they’d learned from their experience. What stuck out, she said, were the bloopers.

“One thing that came up a lot around the holidays was that a lot of people wanted to talk to our bot about Santa,” said Clark. “Unfortunately, the content we had about Santa Claus looked like this: ‘You know what I realized the other day? Santa Claus is the most elaborate lie ever told.’”

The bot chose this line because it had been taught using jokes from Reddit, explained Clark, and while it might be diverting for adults, “as you can imagine, a lot of people who want to talk about Santa Claus … are children.” And telling someone’s curious three-year-old that Santa is a lie, right before Christmas? That’s a conversational faux pas, even if you are just a dumb AI.

Teaching a machine how to have a real conversation is one of AI’s hardest challenges

This sort of misstep perfectly encapsulates the challenges of the Alexa Prize, a competition that will help shape the future of voice-based computing for years to come. On the face of it, Amazon isn’t asking much: just create a chatbot using Alexa that can talk to a human for 20 minutes without messing up, and you get a $1.5 million prize (with $2 million in other grants and prizes). But as Clark’s anecdote illustrates, this is still beyond the capabilities of current technology. There’s just so much that computers don’t know about the world, and there’s no easy way for us to teach them. “Don’t ruin Christmas for small children” isn’t a lesson that translates easily into code.

That’s why Rohit Prasad, chief scientist for machine learning at Alexa, compares the prize to the DARPA Grand Challenge, a series of competitions that was held by the US military agency to build self-driving cars in the mid-2000s. Early entrants failed to even finish the course, but the million-dollar rewards on offer galvanized research. Prasad says he hopes the Alexa Prize will have a similar effect on conversational AI.

Each of this year’s eight teams, selected from universities around the world, will be building their chatbots using Amazon’s resources: basic speech recognition tools from Alexa, free computing power from Amazon Web Services, and stacks of training data from tens of millions of Alexa users. Last month, the bots went online in America, and feedback from users will help teams improve before the judging process in November. (If you live in the US and want talk to a bot right now, just say, “Alexa, let’s chat” to any Echo device, and you’ll be paired with one of the team’s bots at random.) In last year’s inaugural competition, the University of Washington’s chatbot was able to make successful conversation for just over 10 minutes on average, which still leaves the grand prize up for grabs this year.

Teams get to compete, and Amazon gets to pick talent

Amazon isn’t doing this simply for the benefit of the academic community, of course. By organizing the Alexa Prize, the company gets some of the smartest minds in AI queueing up to build technology on its platform. It also gets the opportunity to hire any particularly promising researchers and essentially crowdsource future technological paths for its AI assistant. As Prasad notes: “Every technology built as part of the Alexa Prize is applicable to Alexa.” When I ask the teams about this, though, none of them felt they were being taken advantage of. As one researcher told me: “It’s cheap for them, but also great for us.”

Prasad imagines a future where Alexa can hold human-like conversations, chatting about topics like movies, news, and sports, and answering questions about the details that humans care about — not just the questions machines can answer. It’s the difference between asking “Who won the NBA finals?” and “How did LeBron James do last night?” A more talkative Alexa would become more personable, which is an entity users could relate to “as a friend, as a companion,” says Prasad.

But dreaming of a virtual assistant with the dry wit of Iron Man’s Jarvis is only possible because it’s a familiar fantasy. As soon as you talk to Alexa (or Siri or Google Assistant), you realize that our current AI assistants are, conversationally, brick-stupid. At best, they handle basic commands like “start a five-minute timer” with quiet competence; at worst, they misinterpret any sentence containing more than a single clause. It’s true that there is a huge potential for Alexa to become part of the family, as Prasad envisions, but this says more about our instinct to anthropomorphize the world around us than it does the technology’s core capabilities. The question is: how do we teach Alexa to really talk?

For the teams at this year’s Alexa Prize, there are two basic approaches for solving this huge task. The first is to use machine learning, specifically deep learning, to analyze large amounts of data and slowly sift out the patterns of a normal conversation. This is very much the most exciting and up-to-date option, but, as team after team tells me, it’s also the most impractical. As one competitor put it: “Everyone starts with machine learning, and eventually, everyone realizes it doesn’t really work.”

“Everyone starts with machine learning, and eventually, everyone realizes it doesn’t really work.”

Why? Because human speech is a combination of strict rules and wild variety that’s difficult for AI systems to learn from data alone. Speech contains many unforgiving rules related to grammar, spelling, and tone. But it’s also a space of boundless imagination, where you convey the same basic information using a near-infinite variety of words. Machine learning is fantastic at learning vague rules in restricted tasks (like spotting the difference between cats and dogs or identifying skin cancer), but it can’t easily turn a stack of data into the complex, intersecting, and occasionally irrational guidelines that make up modern English speech. And while AI is pretty good at coming up with new data that matches examples it’s already seen (like drawing portraits of fake celebrities after seeing a lot of red carpet headshots), it struggles to do so in the world of language without making mistakes that give the game away.

The other approach to this task involves writing specific rules and templates for a chatbot to follow, a type of AI design known as “handcrafting” or “hardcoding.” So, for example, if a user says the words “favorite team” and phrases their utterance like a question, the computer might scan for references to specific sports, find a mention of “baseball,” and then spit out a prewritten reply: “My favorite team is the Yankees.”

This approach yields consistent results, but it’s time-consuming to design, makes a lot of mistakes, and can only handle a limited number of topics. It’s been proved many times before that rudimentary, hard-coded chatbots can get quite far in apparently complex conservations. (Consider ELIZA, a “psychiatrist” chatbot from the 1960s that famously got by using only stock phrases like “Can you elaborate on that?” and “How does that make you feel?”) But a purely handcrafted bot would eventually run afoul of the Alexa Prize’s judges. Not only would they find the conversation stilted, but the bot wouldn’t be able to chat about up-to-date topics like breaking news — something that came up frequently in last year’s judging.

The way forward, say the teams, is to blend these approaches. You combine the creativity of machine learning with the formal structures of handcrafting. Some of your bot’s smarts are generated by pools of data, and some of it comes from prewritten rules. And, sometimes, you cheat.

The team behind the Fantom chatbot from Sweden’s Royal Institute of Technology (KTH) wouldn’t call their approach cheating, of course. But it is, shall we say, a creative way to build a chatbot by blending AI and human labor in a surprising fashion.

Faced with the technical challenges above, the Fantom team said they didn’t want to risk training a machine learning chatbot on datasets from the internet. “If you are scraping from Reddit, you have no control over the content,” Fantom’s Gabriel Skantze told The Verge. Instead, they turned to another Amazon product: the Mechanical Turk.

Amazon Mechanical Turk is a marketplace for jobs requiring human intelligence but no real training. The tasks are often repetitive and laborious and include things like audio transcription, data entry, and identifying objects in photos and videos. (Not coincidentally, these are things that AI is being used to automate, and the Mechanical Turk is an indispensable tool for many AI researchers who need to generate training data or test their systems.) In the case of the Fantom team, they decided to use the marketplace to write the responses for their chatbot. Each query they receive is sent on to a human Turker who composes a reply and sends it back. It’s an automated process, but humans are doing the work for the machines.

When I point out to Fantom’s researchers that this sounds a little too much like bypassing the AI component of this challenge, they happily disagree. Although their bot will use humans to generate replies, there will be a strong machine learning element in how these replies are fed into a conversation. Every time the bot hears a new question to which it doesn’t already have a response, it sends it off to the Turkers to write one, adding their reply to a huge dialogue tree. Machine learning will help recognize when questions are just variations of ones the bot has already encountered. For example, the bot will know that after it’s replied to the query “I like football. What’s your favorite team?,” it can use the same response when asked, “What’s your favorite football team?”

“Over time, we’ll develop more and more intelligent strategies to traverse the tree,” says Ulme Wennberg. “To be able to understand what we have been talking about, what you want to talk about, what we should talk about.”

Good conversational AI requires personality as well as patter

The team is also putting a lot of work into crafting the persona of their bot, a key component in building a compelling conversational partner. Their resident linguist, Mattias Bystedt, created a “huge, beautiful document” of popular US celebrities and their personality types to find out “what most appeals to Americans.” He correlated the most common traits and wrote personality prompts based on these to guide the Turkers’ responses. “We’re creating a character just like you would in a TV show,” says Jonas Ivarsson. “And to do that, we had to find out what people are drawn to.”

Fantom’s approach may seem hacky, but it’s inventive, and it makes apparent an often-overlooked fact of AI: a lot of automation starts as data created by human labor. Other teams in the Alexa Prize who are creating their chatbots’ responses with machine learning are still going to have to train their bot on something, and most of them will turn to a small number of human-generated commonly used data sources, like Reddit, Twitter, and transcripted movie dialogue. We may be using machines to build conversational bots, but it all starts as human chatter.

For the team from Brigham Young University in Utah, finding a source of training data was easy: they turned to their fellow students. To collect data, the group behind the university’s Eve chatbot set up the Chit-Chat Challenge — a campus-wide competition that asked students to submit conversation transcriptions. Chats could be about anything, the team said, but couldn’t include personally identifying information or topics that wouldn’t appeal to people outside the university. Transcripts were scored on length and originality, and the top entrants won prizes like an iPad and a MacBook Pro.

BYU’s Nancy Fulda says the challenge was a runaway success, generating reams of useful data. “The internet gives us loads of text, but none of it is a real analog for human conversation,” she says. Like other competitors, the Eve team had originally tried training a bot on Reddit data but thought the resulting AI was too “confrontational” to be pleasant. Plus, says Fulda, BYU is a religious university that is owned and run by the Mormon Church, so the students “share a worldview,” which, in turn, creates a “more cohesive personality for the bot.” (“We did tell them to keep religion out of the chat,” she adds.)

So far, it’s similar to team Fantom. But while the researchers from KTH plan to recycle their human-generated responses verbatim, Fulda and her teammates are going to try to train a machine learning system on their data to write its own dialogue. Won’t that be difficult, I ask, thinking of all the horror stories I’ve heard at the Prize about AI-generated language. Not at all, they say, they just need to turn words into numbers first.

“How familiar are you with word embedding?” asks Fulda. I mumble something half-hearted about “surface-level knowledge” and “a passing familiarity, perhaps.” “Great,” she replies, “because this is my jam.”

As Fulda explains, to train machines to play with words, you start by asking them to read Wikipedia — “the entire body of Wikipedia.” A neural network scans the text in small windows, centering on one word at a time, but also glimpsing in its peripheral vision the three or four words on either side of it. From this, it learns to predict what words tend to appear alongside one another, and turns this data into what’s called a “vector representation.” You can imagine these vectors as points in 3D space, although the number of dimensions involved is much more than three (usually in the hundreds).

The position of the vector is arbitrary, and by itself, it doesn’t capture any meaning related to the word. But the relationship of one vector to another does. “By looking at different words in this space, you can infer properties about them,” explains Fulda. “So words like ‘apple,’ ‘pear,’ and ‘orange’ are all going to be closely located to one another. While words like ‘disestablishmentarianism’ will be way off somewhere else.”

This means you can do math with words. For example, if you take the vector representation for the word “king “ and subtract the vector for “man,” you end up somewhere in the vicinity of “queen.” It’s not precise, says Fulda, but it’s a fantastic way to let machines to process semantic information as quickly as they process numbers. Even more exciting is that while this sort of calculation has previously been limited to words, recent advances in machine learning mean the same tools are now beginning to be applied to whole sentences. “For us, it opens up huge possibilities,” she says.

With the ability to turn sentences into vector representations, the BYU team hopes they can provide a better feedback loop for the replies generated by their AI system. This, they say, will allow the system to teach itself without human intervention, which means quicker training and improved results. Fulda stresses that the team’s work is in its infancy at this point, but she is hopeful they’ll be able to create a better chatbot with these new tools.

“oh my heck, this is phenomenal! this can do so much!”

“I feel like every researcher has some [method] that is special to them,” says Fulda. “That thing where they’re like, ‘Oh man, I see all the possibilities!’ For me, embedding is like that. I look and embeddings and I think, ‘Oh my heck, this is phenomenal! This can do so much!’”

Handcrafting will get you so far in building a chatbot, and machine learning will take you a little further. But the more researchers I talk to, the clearer it becomes that there are other elements at play. Machine learning is, of course, an engineering discipline. But building AI systems that interact with people takes something closer to artistry. You have to have an instinct for what works and the patience to slog through what doesn’t.

For Sounding Board, the team from the University of Washington that won last year’s competition, this was the real challenge. They say they started their work with “grandiose” ambitions to build a deep learning bot that wouldn’t just talk to you but debate you. They soon scaled this back and decided, in their own words, to “start hacking.” After all, they thought, if the goal is simply to keep a conversation going, there are lots of ways to do that. Along with other teams, they tried all types of tricks to stop people from hanging up on their chatbot. They told jokes, they made users take a personality quiz, they asked them to play games. One rival team even found a way to mess with Alexa’s voice in an effort to keep people engaged. “Their bot would say things like ‘please stay’ in a really cutesy way. It would whisper and get louder. They tried to turn it into a show!” says Ari Holtzman.

Amazon eventually clocked what was going on and introduced stricter rules for the competition: no games, no quizzes, and, please, take it easy on the jokes. But what the Sounding Board team learned from this flurry of experimentation was that you need to strike a balance with the user: give them what they want but not too much of it.

Last year’s champions won by listening to the user

Sounding Board’s approach ended up being a blend of handcrafted rules and machine learning smarts. To stock their conversational larder, they pulled jokes from Reddit, facts from Wikipedia, and devised a bot architecture in which a central dialogue manager handed off conversations to subsidiary “mini skills.” Each of these bots had a different specialty: one could talk about movies, one could read the news, another could crack jokes, and so on. This meant the team could tailor their conversations to their users, following the directions they wanted the conversation to go. They even added a subsystem that kept track of users’ emotions and verbal feedback, which helped the team follow the conversational wind, like a sailor bring a ship in safe to port.

“If someone says, ‘that’s boring, that’s terrible, that’s awful,’ we detect it and say we’re sorry to hear that,” says Elizabeth Clark. “We always want to acknowledge what the user has said, then change the topic and offer up some new direction.” And guess what? It worked. The team says they’re happy to retire for the prize’s second year. When someone asks why they’re not returning during a panel session, their supervisor shouts out quickly, “Because they need to finish their PhDs!”

Speaking to Amazon executives at this year’s Alexa Prize, there is a sense of determination. Voice-based computing is the future, they say — or, to be more precise, it’s part of the future. No one tells me that my laptop and phone are going away anytime soon, but they’re convinced voice commands will join these objects in a holy trinity of digital interfaces.

“I firmly believe that ambient computing [is] here to stay,” Amazon’s head of devices, Dave Limp, told The Verge. “And I wouldn’t have said that a year ago. We were still not there.” Now, says Limp, there’s enough momentum in the form of customers, hardware, and potential use-cases to “call the ball and say that this is likely to be a paradigm that’s sticking around.”

If ambient computing does become established, then Amazon has a very good chance of leading the field. Unlike Google Assistant and Apple’s Siri, says Limp, Alexa is unique in that it was never developed with a mobile phone in mind. That means it’s never had to compete with an existing user interface, and it’s hardware-agnostic. “We got to start with our assistant from a clean whiteboard,” he says.

Just as importantly, Amazon treats Alexa as a platform as much as a product. The team responsible for the voice assistant isn’t just making new features for consumers; they’re also building tools so other companies can use Alexa for their own products and services. This is why Alexa shows up in everything from alarm clocks to cars. In many ways, the company’s strategy with the voice assistants mirrors that of Amazon Web Services (AWS), its cloud computing service. AWS is all about facilitating other companies’ ambitions by providing them with processing power and data storage on tap. With Alexa, Amazon is trying to give them a voice interface. If ambient computing is the future, then Alexa could become the dominant operating system, the Windows of voice computing.

“In a weird way, in my job, I get to look into the future a little bit,” Limp tells me. “I can see what component manufacturers are making, what our algorithm teams are coming up with, what the state-of-the-art of science is. And I and my teams can connect those dots a bit, and take a risk on what customers might want in the future.” What he thinks they’ll want is a way to talk to the air and command the digital world around them.

All this helps to explain why the Alexa Prize is so important to Amazon: the company wants to know that voice interfaces have a future beyond their current capabilities. But it doesn’t help us understand the real challenge of conversational computing. With Amazon’s commercial clout and expertise at building and propagating computing platforms, it wouldn’t be surprising if Alexa became the world’s default voice interface. But if the best we can hope for is AI that understands simple voice commands as consistently as your phone recognizes taps on its display, does that change the world? No. It just means we look at computer screens a little less.

An AI that can talk like a human might become our most addictive gadget yet

Talking to teams at the Alexa Prize, it’s clear that the AI community is dreaming much, much bigger than this. If we could genuinely talk to our devices — not just command but converse with them — then our relationship with the digital world would be upended. Imagine a computer that talks like a human but has the knowledge of the internet and the patience and flexibility of a machine. Imagine if it sounded like your favorite celebrity or even someone you knew and loved. Imagine how much time you would spend with such a device. It would make our codependent relationships with our phones look healthy by comparison.

But we’re a long way from that.

The researchers were upbeat about the chances of someone claiming the grand $1.5 million prize when the bots are judged later this year. But they also agreed that real conversation was still a distant prospect. When I ask the Sounding Board team what the unsolved challenges of conversational AI are, they answered with a checklist that would intimidate a minor deity: depth, said one. Understanding, said another. Wisdom, said a third. As Hao Fang, the team’s leader, summarized it, “We can’t have deeper conversations because we can’t understand everything the user says, and we can’t understand what the text we’re learning from says.”

These aren’t trivial challenges. And if you speak to the Prize bots yourself, even accepting the fact that they’re currently in their underdeveloped, larval stage, you can sense the gulf in understanding as clearly as you can hear that the voice you’re speaking to is artificial. The techniques being developed by the teams at this year’s Alexa Prize are ingenious and worthy of praise, but chatbots still have a long way to go before they match humanity in its gift of the gab.

Talk, it turns out, is tough.