History may remember the decade between 2007 and 2017 as a brief and peculiar period when people barely spoke to their phones. Apple launched the iPhone in June 2007, catapulting the smartphone – and its ubiquitous touchscreen – into the ascendancy. For the next ten years, phone users tapped out texts and tweets, and, in the latter days, flicked balls at cute monsters on street corners.
But the end of the touchscreen era has drawn nigh. Spoken interfaces, powered by machine learning and new insights into the psychology of conversation, are beginning to transform the way we communicate with our devices. Speaking is more intuitive then tapping and swiping through endless menus; and spoken responses make for completely hands-free interactions.
Apple’s Siri, found on its iPhones and iPads, and Amazon’s Alexa, introduced on Echo, its voice-activated table-top speaker, might be the best-known examples; but there are now dozens of other such “virtual assistants” eager to hang upon your every word, including the much-hyped Viv, created by a new company founded by the engineers who designed Siri, which promises to be “an intelligent interface to everything”. A recent survey by Creative Strategies, a market-research firm, found that less than 5% of iPhone and Android owners had never used their phone’s voice-activated assistants. However, only around a third use them regularly, suggesting that today’s technology is far from perfect.
“Looking at a little three-inch screen is just some weird thing we do at the moment,” says Axel Roesler, a professor of interaction design at the University of Washington in Seattle. “Things are evolving towards a conversational interface, a partner, that is with you at all times and in any environment.”
For many years, though, computers could not understand the simplest of spoken sentences. But programmers have now solved the problem of understanding natural language, says Kenn Harper of Nuance, a computer-speech company that has developed several virtual assistants. “In the last five years, the technology has advanced so people can express themselves in lots of different ways, and the system does a pretty good job of deciphering their intent.” Ask Siri about a restaurant, for example, and then say “Can I book a table?”, and the assistant will realise that you are referring to the same place.
At this level, interacting with a virtual assistant feels much like using a smart search engine. Nuance’s Dragon Go! app and Google Now, a voice-controlled assistant that aims to predict what you’ll be interested in next, are primarily devoted to delivering information as quickly as possible, with impersonal, even robotic, efficiency.
More complicated tasks, however, call for a different approach. “If the assistant needs to know extra information to carry out a task the way you want, there’s a big opportunity to do that through personality, where it’s engaging and fun to talk to,” says Harper. “You can create confidence in the user that the assistant is there to help them.”
These more ambitious virtual assistants, like Siri, Alexa and Microsoft’s Cortana, are able to respond like humans, delivering snappy come-backs and jokes. They are also, to a woman, female. “For whatever reason, a lot of our customers have chosen to bring a female voice assistant into their experience. Some feedback we’ve had is that a female voice comes off as more nurturing,” says Harper – though assumptions about gender roles might also be an influence.
But building a convincing character involves much more than just hiring a mellifluous actress. Interface designers agonise over vocabulary and delivery, choosing the words and phrases appropriate to a persona. They must also give the assistant enough intelligence to respond briefly to simple requests but reveal more of its character when things get more complex. “There are going to be scenarios where its personality comes to life because that’s how you entertain the user and ensure that the right thing is happening,” says Harper.
Some researchers, however, think that any attempt to build artificial personalities will be doomed until computers can be made to understand the feelings of humans. “Many studies have proved that people place more trust in something or somebody with emotional and social competence,” says Björn Schuller, a professor in machine learning at Imperial College London. “If computers are not able to get these signals right, then not only will they not feel natural, they will also feel less intelligent.”
We use more than words to communicate with others. Our eyes frown or twinkle, our arms are flung open wide or crossed in a huff. Schuller analyses these so-called paralinguistic signals computationally, using audio and video footage of people’s voices, faces and bodies as they speak. Smart watches and fitness gadgets are particularly useful here, measuring arm movements, heart rate and skin conductance.
Schuller uses this data to predict someone’s valence – a psychological term for whether the emotions you experience in a given situation are positive (happiness, joy) or negative (anger, fear). “For valence, we can get in the 80% range for accuracy, which is about the same as a human,” says Schuller. “But for other things, such as recognising whether you’re drunk just from your voice, we’re beyond human performance.”
Once a virtual assistant knows whether it is talking to a sarcastic teenager or a genial drunk, it can tailor its responses accordingly. Schuller has built an assistant called Aria that has none of the depth of knowledge manifested by commercial AIs like Siri or Cortana, but that appears intelligent because it responds to social and emotional cues. “Aria can keep up a conversation of half an hour or even more with random people off the street just by saying ‘Mmmm’, ‘Aha!’ or ‘Tell me more…’ at the right time,” says Schuller.
Integrating such emotional sensitivity into today’s devices is not an onerous task, according to Schuller: “Our software runs entirely on a smartphone, in real time, without access to the internet. It’s really just about putting the pieces together.”
Another company thinks it has already succeeded. Jibo is a table-top device, like Alexa’s Echo, that is billed as the world’s first “social robot”. Unlike Amazon’s assistant, Jibo has built-in cameras to recognise faces and track emotions. When it launches in October, says Steve Chambers, CEO of Jibo, the device will enable “social reciprocity” – the natural to-and-fro of human interactions in which each person recognises and responds to the other’s behaviour.
Jibo expresses its own emotions – including “over 20 resolutions of ‘happy’” – with a screen that shows a cartoonish eye, a body that rotates to face whomever it’s speaking to, and multiple tones of voice. Uniquely, Jibo’s AI-powered persona is gendered male. “His character identity is a ten-year-old boy, inquisitive and still learning, sincerely interested in his world,” says Chambers.
As assistants get smarter, more human and easier to talk to, we will inevitably share more of our own desires and fears with them. One day, we might even forget – or stop caring – that we are not talking to a real person at all but to an agent of corporations that do not necessarily have our best interests at heart. Long before then, happily, there are likely to be empathetic virtual assistants aplenty in whom to confide such troubling thoughts.