Spoken sentences and words are the heart and soul of a voice experience. It’s in these moments, when we “have the mic” (so to speak), that designers can establish personality; express generosity; and create a sonic world for another person to inhabit.
But what goes into crafting, and critiquing, these spoken sentences? Where visual designs have some foundational pillars–typography, layout, and color–what does conversational design have that’s similar? What is the “color” of conversation? The “layout” of VUI design? The “typography” of speech? What, in other words, are the different disciplines that a conversational designer can draw from to craft and critique a conversational experience?
(Asked still another way, what disciplines does a conversational designer need to be fluent in? If I were hiring, what would I be looking for? If I were training a designer, what would I be drawing from?)
Here are some of the areas that I think are important to understanding
- Linguistics. Written words and spoken words are different. We tend to write in lengthy sentences with a careful structure and a wider vocabulary. We tend to talk in chunks of seven words or so, interrupting ourselves as we go along, and using a simpler, shorter vocabulary. We use more vocatives, we take shortcuts–contractions, ellipsis, other “reduced forms”–and we tend to repeat ourselves, using “bundles” of relatively formulaic phrases. And of course, lest we forget, speech is interactive, which sets itself apart entirely from any kind of academic or news-like writing. A conversational designer should know the basics of how the spoken word differs from the written word, and why that’s important–which is fundamentally a linguistic question. And while linguistics is a large and intimidating field, most of the “speech verses writing” questions are tackled in sociolinguistics–a field that also talks about…
- The Properties of Speech. Not only is spoken syntax and grammar different, but there’s an added element: speech is, well, speech. It’s spoken! And so it contains “paralinguistic” properties: breath, tone and intonation, prosody, volume, pitch, the speed at which we speak. Speech also has to be vocalized with a voice that has a certain timbre, or particular qualities (i.e. a baritone, smooth, female voice or a low, gravely, male voice). A conversational designer needs to know the basics of speech, and how it’s controlled with whatever technology they’re working with: whether it be a text-to-speech engine, or a voice actor in a studio.
- Stance and Persona. Technically, this is directly linked to the first two points, but it bears repeating: speech expresses an attitude, toward the other person. We might refer to them as “sir” or “dude,” we might say “Please pass the butter” or (with a blunt imperative) “Pass the butter–now.” All of these suggest emotion and feeling toward the other person in the conversation. This also combines to express a personality: bubbly and outgoing, or short and direct, or clear and professional, or casual and friendly. A conversational designer should know how to establish this “art direction” for voice experiences, and what personality they want to project. This is all vital because people will make judgments about your conversational interface’s personality, even when they should know its a computer. That’s what people do.
- Memory. Unlike graphical interfaces, which linger in space to be viewed and reviewed, voice interfaces do not linger. Once something is spoken, it’s gone, and resides only in the memory. But memory is limited. So we have to be really aware of cognitive load: we can’t give too many options, nor say too much, in any one turn, lest we overwhelm a person’s memory (or patience). Much of conversational design’s “best practices” comes down to keeping prompts short, sweet, and simple–working with human memory, instead of against it.
- Sound and Music. Traffic, a honk, hammers, and birds–suddenly, you’re in the heart of a bustling city. A soft vibration, gongs, and steady throbs of “ohm,” and you’re now in a monastery, ready to meditate. The familiar three notes, and suddenly, you’re prepared to hear broadcasters or comedians from NBC. Sound and music can transport you. Or with a short “Ching,” it can inform you (You just got paid!). It can establish mood, or the completion of a task. It can change your emotions, or invoke memories. A conversational designer should know the basics of sound: pitch, rhythm, timbre, and melody, and the varieties of information and emotion it can convey.
- Platform Limitations and Opportunities. As much as I’d like to design for Jarvis, most voice interfaces are far dumber than that–the burden of weak AI. People can’t speak with computers as naturally as they’d like and expect to be understood. For example, if someone wants a large pepperoni pizza with sausage, pepperoni, and pineapple but with gluten free curst–well, with current limitations, we have to ask for only some of that information at a time. We have to be aware of the limitations, and help the user work with those limitations instead of against it, lest we provoke confusion, frustration, or anger. And we need to be aware of the opportunities each platform and technology affords. These technologies and abilities are always changing; a conversational designer needs to stay abreast of the trends and technologies.
There are other things we need to consider, of course. The general “UX” process of testing with real people; a consideration of context; how voice interacts with graphical interfaces, such as on a smart speaker with a screen; recommended best practices; the nuances of creative writing and crafting a brilliant persona; the drawbacks of VUI design, and discerning which use cases are appropriate for voice and which are not; something of the history of the field. The list could go on. But the above points cover what I think someone should know, first and foremost, to design the foundational artifact of VUI design: prompts and speech. Armed with these concepts, I’ve found that it’s easier to both describe and prescribe the right prompts; to accomplish whatever goal the user has in mind, in the right way.