Alexa, Google Assistant, Siri, Cortana, Facebook M, Bixby – automated voice interactions are on the rise. We talked to Ralf Eggert, CEO of Travello and speaker at Voice Conference, about current trends and the “multimodal” approach of voice development.

Amazon’s Alexa family continues to grow. It’s no surprise, given the longstanding strong trend towards voice control. How do you see the current development in voice in general?

Ralf Eggert: We are still at the beginning of a development that has just only begun. Some people would like to put the topic on par with former “overhyped flops”, such as Second Life, but I am convinced that the topic is here to stay. A linguistic interface is simply the most natural way to tell a computer what to do.

In early 2018, the press reported that Amazon alone had advertised more vacancies for its Alexa division than the entire Alphabet Group, including Google, YouTube, etc. [1]. This alone shows the significance with which Amazon works on the topic. And Google does not want to be inferior with the Google Assistant and its Google Home devices, after some initial difficulties. With Microsoft’s Cortana, Apple’s Siri and Facebooks’ M, the other large technology companies are also involved. Also, we haven’t even looked at the Asian market with Bixby and AliGenie yet.

The technical innovation of Alexa and the Google Assistant in the last months alone or the newly introduced devices of the Echo family at the end of September shows that we stand at the beginning of a revolution, just like in the 80s with the emerging GUI applications or the Mobile Web, which only really took off 10 years ago.

So I see the current development very positively.

In your session, you talk about a voice-first approach. As a developer, where do I start and which factors are to be considered?

Ralf Eggert: Beginners should concentrate first on the development of Alexa Skills and/or Google Actions. There is an abundance of tutorials and documentation to discover, as well as first-step templates for a new developer to get started with. There are also many free, regional Alexa workshops, e.g. from Amazon, where beginners are introduced to all important topics. Furthermore, the central portals for developers are the first entry points [2, 3].

The most important thing is of course the first idea. What should be implemented. One template can be adapted quickly, but it’s not really fulfilling. I, as a voice developer, have to adapt quite a lot. There’s just the language as an input medium. And while everyone would get rightly upset with a website or a smartphone app if the interface or individual buttons constantly looked or were designed differently, it is something completely different with a voice application. In contrast to the eye, the human ear simply wants more variance. If every greeting or prompt (a request to react) is always the same, it quickly becomes boring.

While it’s not a problem on a website or in an app to display long lists and make them navigable, this simply doesn’t work with voice output. Nobody wants to have to navigate through dozens of elements of a list by voice. Currently, only a few of all Smart Assistant devices on the market have a display. With the voice-first approach, I, as a developer, must therefore always assume that my users don’t have a display for showing long result lists.

This is another aspect, which is often forgotten by both developers and companies. Only very few voice applications remain unchanged after the release. Actually, the first activation is only the first step of a long lifecycle, which requires a lot of learning in the beginning. Developers should be prepared for this as well.

Stay tuned!
Learn more about Voice Conference:

Your session is called “Multi-Modal Voice Development with Amazon Alexa.” What does “multimodal” mean to you?

Ralf Eggert: In the context of language assistants, multimodal means that further output- and input-media is used in addition to the usage of pure speech. This is also the reason why the approach is called “Voice First” and not “Voice Only”. To stay in the Alexa world, “multimodal” means that we are talking about additional outputs in the Alexa app or on a display, such as the Echo Show or the Echo Spot. And the touchscreen of the Echo Show or the Echo Buttons are used for the input.

When developing an Alexa skill, I don’t just have to concentrate on the output of text, the so-called “output speech”.This is the text which is spoken directly by the Alexa voice and which can be extended and very individually adapted e.g. by the SSML (Speech Synthesis Markup Language). In addition, there is the optional output of information via a so-called card, which can also contain images but no tags. And then there is the output on a display, which allows lists or videos in addition to text and images. All of this is sent back to the Alexa Voice Server in a fixed JSON format and thus generally corresponds to the interface I have to use as a developer.

Alexa recently released the Alexa Presentation Language (APL), a new design language that allows us developers to customize the display of an Echo Show or Echo Spot. With the Google Home Hub, Google has also launched a device that has a display and can therefore offer a multimodal experience for the Google Assistant.

Keyword Tools: Which criteria can be used to select tools for your own project?

Ralf Eggert: The choice of tools depends strongly on the technological platform with which, for example, an Alexa skill is to be implemented. In general, all solutions have in common the fact that the developer gets in touch with the Alexa Skills Kit (ASK). Here, an Alexa Skills is created, configured and the Voice Interface defined. Intents, Slot Types and Utterances are created. An Intent specifies the intention of the user (What does he want), a Slot Type defines parameters (such as weather in Hamburg) and an Utterance defines which words the user utilizes to express his intention. The certification is also initiated in the ASK.

After that, it gets interesting. If I want to stay in the Amazon world, I can have my complete Skill Code developed and hosted at Amazon Web Services (AWS). The developer then has SDKs for Node.js, Python or Java at their disposal. AWS also provides databases and other important tools.

If I decide against AWS, in favor of my own endpoint server, the possibilities are almost unlimited. Since Amazon Voice Services communicates via JSON Request and Response, as a developer I can choose whatever language I want. But here I have to write my own frameworks or libraries or use an open source variant.

If you prefer not to dive too deeply into the development and can accept certain limitations in feasibility, then you can use graphical tools such as Storyline, which provide a web interface for the design and development of an idea.

When choosing the right tools for the development, I must first decide whether my idea can be implemented with a graphical tool or not. If not, the question of the programming language and hosting arises. For example, can I live with Node.js and AWS or do I prefer to rely on the know-how of my PHP developer teams and host myself? After I’ve made these decisions, everything else follows.

Thank you very much!