Serverless Smart Radio — Part II — Step functions

By Bahadir Cambel

You’re currently reading Part II of the Serverless Smart Radio article. Head over to read Part I if you missed it.
In Part I, we made a introduction to our Smart Radio system and the technical components that are involved.

In this article we will talk about Step functions, first a small introduction and looking at who is using it. Then we will dive deep into the workflow of our Smart Radio system.

Design and run workflows that stitch together services such as AWS Lambda

Step functions are orchestration layer for your functions, simply put a step function is a pipeline.

Although AWS had already a service called Simple Workflow(SWF) (which is not that simple), Amazon decided to fill the gap by developing Step Functions to help users orchestrate Lambda functions or activities.

Here is a link to get you started with Step Functions

AWS Step Functions allows you to coordinate individual tasks by expressing your workflow as a finite state machine

JSON based design language called ASL(Amazon States Language) that can be done in the interface helps you design your state machine. No drag drop, nothing fancy. You start typing in the editor, and the state machine starts to appear on the right pane. Nice addition is that there is a intellisense when you select a Task type, the UI will suggest available Lambda functions.

The following list contains the possible states that you can have in a step function definition.

  • Do some work in your state machine (a Task state).
  • Provide a delay for a certain amount of time or until a specified time/date (a Wait state)
  • Make a choice between branches of execution (a Choice state)
  • Stop an execution with a failure or success (a Fail or Succeed state)
  • Simply pass its input to its output or inject some fixed data (a Pass state)
  • Begin parallel branches of execution (a Parallel state)

A task can be either a Activity or a Lambda Function. In our system, we only use Lambda functions. If you want to learn more about Activities, click here

You can use the AWS APIs or the user interface to look for the executions

The following limits could be a show stopper for your system. At the time of writing this article here are the important ones;

  • Maximum number of registered state machines: 10K
  • Maximum open executions: 1M
  • Maximum execution history size: 25K events
  • Maximum execution history retention time: 90 days
  • Maximum input or result data size for a task, state, or execution: 32,768 chars (we will get into detail of this limitation and how to solve it)

Check out the full list here.

Yelp, recently transformed their monolith subscription billing system with Step Functions

Coca-Cola uses step functions to implement their vending machines payment processing loop.

The Guardian automates Subscription Fulfilment using Step Functions

  1. Only pay very small amount per transition change; no extra charge, no resource reserve.
  2. Visual representation of the workflow
  3. Can glue Lambda functions together using Step Functions.
  4. 90 days Historical data of executions
  5. Each step is transparent. See input, output and exception info.
  6. Implement easy retry, backoff and exception handling strategy
  7. Run multiple versions of workflow simultaneously
From a 1 hour audio file to smaller chunks, aka segments

We use Step Functions to manage the workflow of the Audio Processing. Our Live Radio Management System generates a wav file in every hour, and once the file is uploaded to S3, we trigger a Lambda function to kick start the whole processing of the Audio. The goal is to divide the audio file into smaller chunks, predict their topics and serve to end users. To accomplish that we use the two main step functions take care of the whole process;

  • Audio Processing Flow
  • Transcription

Step functions retain their results for up to 90 days which means you can investigate the input/output/exception of every task’s activity for that long. (however max 25K events history) This transparent operation execution helps you a lot during the development and also investigating issues in your production environment.

You may query the AWS Step Functions API to get the details or simply clicking the AWS User interface. It’s such handy tooling!

Currently AWS Step Functions does not publish the versioning of the Step Function Definition but I bet they have a revision system behind the scenes that control which execution should be using which version.

As seen in the diagram, step functions allow us to bind multiple lambda functions into each other and at some cases we use simple “Choice” (If) statements to whether to call the lambda function or not.

Each activity in the step function can also be directed to auto-retry in special cases such as error with the max attempts and backoff strategy.

For example, if the input json contains the “media” file, we don’t need to transcode the audio, thus we skip the step and jump to the next thing on the flow. The same case occurs with the transcription.

Waiting! The one case where step functions shine a lot; in use cases such as we expect some other party (service/API/tool) to complete a task. In theory, during this wait time the financial cost of waiting to our system is near to zero. There is no server operating at that moment when we wait for the other party to finish the work.

In practice, we pay for each transition between our steps is $0.000025. Thus, if the service we expect the data from takes 20 minutes to complete, we could query the state of the external service in every 1 minute,

$0.000025 X 20 X 2 = $0.001

We could also lower this cost by initially waiting more before getting into the waiting loop, e.g wait 15 minutes at least and then start looping if it is worth for your use-case. For example Coca-Cola uses step functions to implement their vending machine payment processing loops.

This is a perfect example to integrate with non event-driven systems when there is no way to signal the caller and the service wants you to do polling to realise the result of your request.

Each lambda that is part of the workflow retain their input as their output and add their data into the result. Thus each lambda can transparently discover what the other available functions have done to the same input. For example transcription lambda adds the “transcription_job” as a dictionary entry into the output. Any other functionality that requires the data, can simply look it up in the input JSON and

Eventually in each call we pass more information that the actual function needs however we constrain the input only with the things that we need to operate the lambda function. The overall structure of the input can be change drastically as long as what the Lambda function need still exists in the input JSON.

The first example is the JSON input that we start the our Audio workflow that contains major items ;

  • configuration/orchestration entries
  • settings ( transcribe -> [Yes/No], bucket -> [BUCKETNAME])
  • audio file location

Since this was a replay activity, we already had the transcription for the audio file, hence we pass that information and the responsible lambda function will skip the transcription and use this information.

You might be wondering what are those s3-data-file entries are. Let me elaborate on that; AWS Step Function limit page describes as

Maximum input or result data size for a task, state, or execution
32,768 characters. This limit affects tasks (activity or Lambda function), state or execution result data, and input data when scheduling a task, entering a state, or starting an execution.

One of the key limitations of Step Functions is that JSON cannot exceed certain character limit. Thus whatever you will be inputing/outputting, you should consider this limit. What we come up with is to return an S3 file location that contains the result of the lambda function. You may also apply a double-decker strategy to place some part of the information directly and place the result into a S3 file.