Having talked with several people at various AstriCons and local Asterisk meetups, I’ve heard that many people have not tried to set up speech engines to work with Asterisk. This is a quick tutorial for the way that we integrate Text-to-Speech and Speech Recognition engines with Asterisk.
Start your Engines
Before you dive into Asterisk, you need to select a speech engine. There are two main types of speech engines: Text-to-Speech (TTS) and Automatic Speech Recognition (ASR). Generally speaking, your choices for TTS engines are more plentiful. There are more vendors in the TTS market and they cover more languages. ASR vendors and languages tend to be more sparse, though coverage for North American languages is good. Mojo Lingo has worked with several over the years: LumenVox, NeoSpeech, Nuance, Cepstral, and AT&T Watson, to name a few. All of these companies provide TTS voices. Only LumenVox, Nuance, and AT&T Watson provide any ASR. All of these except AT&T Watson provide an MRCP interface, which will be the focus of this article.
In case you’re wondering, yes you can mix-and-match TTS and ASR engines. If you find that you prefer the synthesized voice from one vendor, say NeoSpeech, you can still use LumenVox’s ASR at the same time. This is especially useful in situations where you need international language support.
MRCP vs. HTTP
Traditional telephony systems have used MRCP, Media Resource Control Protocol, as the interface between a telephony server (Asterisk) and a TTS or ASR speech engine. MRCP offers several advantages over HTTP: the audio is streamed in real-time to or from the engine, meaning that there is lower delay in processing the audio. MRCP version 2, the most common version, is actually an extension of SIP. This means existing SIP knowledge can be used to troubleshoot it, and existing SIP infrastructure can be used to load balance it.
On the other hand, HTTP is more familiar to developers, especially modern web and mobile developers. However, besides being a slower interface for Asterisk, there is no native support in Asterisk for HTTP speech engines. As such, we recommend using MRCP whenever connecting a speech engine to Asterisk.
MRCP in Asterisk
The best way to connect Asterisk to an MRCP server is to use the UniMRCP package. UniMRCP consists of a library that provides MRCP support, as well as a suite of native Asterisk applications to interface with MRCP servers from the Dialplan.
Installation instructions for UniMRCP on Asterisk can be found on the UniMRCP site.
Once you have UniMRCP installed and loaded in Asterisk, you will have three new Asterisk Dialplan applications. These applications include:
- MRCPSynth: for text-to-speech
- MRCPRecog: for speech recognition
- SynthAndRecog: for combined TTS + ASR
For each application listed above, I’ve linked to its entry in the official UniMRCP documentation.
In Asterisk Dialplan, you might have something that looks like this:
- We want to play the audio file ‘/srv/app/corp_ivr.wav’
- We want to allow the callers to speak various responses, like “Sales”, “Support”, or “Operator”
- We also want to allow callers to press buttons, like 1 for Sales, 2 for Support, or 0 for Operator
- You also want to allow the caller to “barge,” or interrupt the prompt, rather than forcing them to wait until it finishes playing
- You want to reject anything with a speech recognition confidence lower than 40%
- We’ll also assume this is in US English only
To do this, we need to pass three documents to SynthAndRecog:
- The first argument is the audio prompt to play:
- The second argument is the list of grammar URLs, one each for speech and DTMF, separated by commas:
- The third argument is the list of flags, separated by ampersands (see above for the link to documentation on the set of available flags)
Here’s our completed example:
exten => s,1,SynthAndRecog("file:///srv/app/corp_ivr.wav","http://127.0.0.1/documents/corporate_ivr.main_menu_voice,http://127.0.0.1/documents/corporate_ivr.main_menu_dtmf",b=1&spl=en-US&ct=0.4)
If you look at the docs, you might be overwhelmed by the number of options available. The good news is that most of those options come with sane defaults, and in most cases, you won’t ever need to change them. A rule of thumb when working with speech engines: When in doubt, trust the defaults provided by the vendor. They’ve spent a lot of time tuning their software!
Alternatively, we would recommend checking out the Adhearsion framework. Adhearsion makes developing Asterisk applications a lot easier by providing standardized and well-documented tools in a real programming language. Adhearsion has native support for using Asterisk’s MRCP connection.
In Adhearsion, the same thing would look something like this:
prompt = 'file:///srv/app/corp_ivr.wav' grammar_urls = [ 'http://127.0.0.1/documents/corporate_ivr.main_menu_voice', 'http://127.0.0.1/documents/corporate_ivr.main_menu_dtmf' ] ask prompt, grammar_url: grammar_urls, interruptible: true
One last warning
Asterisk Dialplan and Asterisk AGI have hard-coded limits that prevent using more than 1024 characters in any Dialplan application. This limit can really come to bite you if you end up using long speech recognition grammars or text-to-speech documents. Fortunately, MRCP allows you to reference grammars and documents by URL. We strongly recommend that developers deliver speech recognition grammars (SRGS) and text-to-speech documents (SSML) via an external HTTP server whenever possible, as we showed in the examples.
Want to know more?
Mojo Lingo provides consulting services for speech-driven telephony applications. Contact us today to learn more about how we can help.