My First Alexa Skill — Lessons Learned

“Papa! May I watch iPad” — “No”
“Mama! May I watch iPad” — “No”
“And if I don’t wake you up at Sunday 6:30am?” — “Ok! 30 minutes!”

Parenting is not easy and involves sometimes really hard bargaining. In my case it’s typically related to how long I allow our kids to watch iPad. Recently I introduced a process where they collect points if they do something extraordinary (e.g. letting us sleep on the weekends) and can later trade the received points for iPad time. You may grinch at my method (my wife certainly does) but it works surprisingly well and the kids like it.

I used to track the points on a piece of paper and thus make it very visible and accessible for everyone in the family.

However, as engineer I found this far too low-tech, so I happily exchanged the simplicity of pen and paper with the complexity and quirkiness of an Alexa skill.

My goals were to 1) understand the pitfalls of skill development, 2) to build a serverless application, and 3) to understand the DevOps complexity that comes with it.

In this article I’ll share some highlights and lessons learned from developing my first open source Alexa skill.

The basic requirements for the skill are simple:

  • As parent, I want to give points to my kids, in order to express my gratitude for extraordinary achievements.
  • As kid, I want to check my point balance, in order to figure out how long I could watch iPad.
  • As kid, I want to trade the points with iPad time, in order to watch Paw Patrol during the week.

During testing I figured out that I needed to simulate a new user and discard all data quickly. At first I added a separate lambda function for cleanup, but it was impractical to use. I therefore added a Reset intent which just drops all data and can be used like any other speech commands.

For Alexa skill development you need to understand three basic concepts:

  • Intents: something a user wants from your skill. e.g. GrantPoints.
  • Slots: what parameters you expect. e.g. NumberOfPoints, Person
  • Utterances: examples of how a user would express the intent. e.g. “Grant {NumberOfPoints} points to {Person}”

When a user talks with Alexa, she will figure out the intent and fill the slots with concrete values. I’m also using the option to have users confirm some intents (e.g. the Reset intent), and have Alexa asking for slot values if they were not provided in the first place. That’s called Dialog Model.

Even though it’s possible to develop a skill purely “as code”, I found the Alexa development console to be very efficient to get started. It provides a good UI to quickly develop the skill model and to test the intents and utterances.

When I was designing and testing the skill I used examples like this:

“Alexa, grant five points to Elisabeth”.

However when I asked my kids for testing, they made Alexa and me sweat:

“Alexa, I’m Elisabeth and daddy said I get ten points, but not my brother, he woke mummy early today, and I want to watch Paw Patrol with the points, but without him, he has only five left.”

Alexa still has a very long way to go to understand my kids. But by adding significantly more examples (utterances) I was able to train my kids for a far better success rate.

In the skill model you specify the endpoint that Alexa invokes for each speech command. It is possible to invoke an arbitrary HTTPS endpoint, but I chose to handle commands in a Lambda function.

When a user talks with Alexa, she will decide which skill should handle the request (that’s why you need to set an invocation name), tries to detect the intent, fills the slots, and and calls your endpoint. Although it is not complex to work with the raw json input and output, I use the Python Ask SDK to speed up development. As datastore I use DynamoDB. Given that I don’t expect a massive amount of data per installation, I store all data of an installation in a single record.

Illustration of the GrantPoints intent

Note: The Alexa device does not directly talk with the lambda function. There is a huge AWS cloud in between.

I explored several options of the data persistence layer: normalized schema in an RDS database, fine grained documents on DynamoDB, or Redis with a KV persistence design.

So far the first and simplest solution works really well: have a single DynamoDB table where each document represents the installation of a skill. Like in Event Sourcing append all confirmed user actions to an event log. Have all required aggregations done in code. That’s also supported very well in DynamoDB with code like this:

Appending data to a DynomoDB record

At the time of writing, my DevOps setup is very basic: I use the Serverless framework for managing the AWS Cloud Formation stack. Deployment is done manually by executing theserverless deploy command. Source code management is done in Bitbucket. No fully automated CI/CD pipeline yet.

Luckily I knew up front that managing a cloud formation stack by hand can be a cumbersome experience. Leveraging the Serverless framework allowed me to focus on the business logic and not care too much about deployment and packaging.

As mentioned earlier, I’m using the python ask-sdk. It’s a regular python dependency that you add to your requirements.txt file and install with pip.

I found it very tricky to have python dependencies correctly packaged in my lambda function. It becomes easier if you use the serverless-python-requirements plugin. It’s using Docker which is fine, however on my very slow machine this significantly increased the packaging time.

The second major problem was that adding the ask-sdk increased the size of the deployment package by about 12 MB. And this in turn led to a much longer deployment time.

My solution works great: I put the dependencies in a lambda layer and use the layer in my function. It was a tricky to set up but I got it to work thanks to the article of Qui Tang.

During development I did most of the testing in the Alexa Development Console. It’s very convenient as you can type what you would speak and it shows the json input and output of your function.

For real world testing I invited my kids do play around with the skill on my Alexa. What a painful experience…

For some functions I have simple unit tests that I can easily execute from VS Code without spinning up a local virtual AWS stack.

For automated integration testing I explored using BDD with the behave framework. It allows me to write tests like this:

Scenario: Points are added to existing points
Given I am a new user
When I say "open pointy and grant elisabeth five points"
And I get asked for confirmation
And I say "yes"
And I say "open pointy and give five points to elisabeth"
Then I hear "Elisabeth has 10 points"

A quick search didn’t yield any good results, so I wrote a little wrapper around the skill testing REST API so that the actual step functions for the features are very simple.

Leveraging BDD for automated integration testing of skills works really well and can lead to easy maintainable and very expressive tests. It would be even more powerful during requirements elicitation in order to share expectations between business owners and the development team.

During real world testing, one problem showed up quickly that I didn’t had on my radar at all:

Most of my intents expect a persons first name as input. Amazon provides the predefined slot type AMAZON.FirstName that can detect a long list of first names. It works great in testing via text input in the development console. In real world though, Alexa thinks she can hear the subtle differences between “Elisabeth”, “Elisabet”, “Elizabeth”, etc. and sets the slot value accordingly. To mitigate, I use the cologne_phonetics algorithm that detects if two words sound the same in German. So if you have kids that almost sound the same, you probably won’t be happy with my skill…

For a skill to be publicly available, it has to go through a certification process that takes about two days. It is relatively painless when you stick to the skill certification guidelines. Different reviewers will give different feedback though, so try to pass it at the first try.

What I really struggled with is the built in LaunchRequest command. According to the guideline it should return a quick welcome message and some hints of what you can do. The implementation looks something like this:

LaunchRequest with reprompt

After speech_text Alexa waits a few seconds and will then re-prompt with ask_text. That works great, but you’re supposed to be able to say “stop”, “cancel”, etc. to cancel the interaction. There is a built in AMAZON.StopIntent and a AMAZON.CancelIntent which are supposed to do exactly that. However whatever I tried I couldn’t get it to work.

To mitigate the problem, I had to implement a custom Cancel intent. It’s super simple and just contains utterances like “stop”, “cancel”, “please stop”, “stop it”, etc. It took me five minutes to implement and five hours to figure it out…

Overall developing an Alexa skill was a great learning experience. My top three takeaways:

  • There are a few pitfalls with skill development, but overall it’s quite painless and straightforward.
  • Serverless is a great fit for skill development and allows you to focus on the business logic. The Serverless framework takes a lot of the deployment concerns away and does not get in your way. Using Lambda Layers can really speed up the deployment process.
  • There’s lots of room for tooling in automated skill testing.