This part of the workflow calls the library to summarize the text. This post from Sam Shleifer describes how the BART model works, as well as providing performance comparisons between different text generation techniques (Seq2seq vs GPT2). Sam is a research engineer at Hugging Face, a company that develops state-of-the-art natural language processing (NLP) technologies. My key takeaway from his post is this piece of Python code that does the summarization:
This code shows that there is already a model we can use, and we feed it the text to summarize. The model was pretrained on a huge text dataset and fine-tuned to summarize news articles. As a result, it performs best on news articles, as compared to other types of text.
It wasn’t immediately obvious to me, but a GPU is required to perform a faster inference. What’s more, the model is quite large (the “bart-large-cnn” model is 1.5GB), so I needed to ensure it was pre-loaded before performing inference. At this point I turned to AWS to get a souped-up cloud machine (EC2) on a cluster configured with a GPU. I ended up using an AWS virtual machine with Nvidia T4 GPU (more on this below).
While I could manually set this all up via the AWS Console, I deferred the manual work to an open source platform called Cortex. Cortex simplifies the process of deploying machine learning models to the cloud. In this case to AWS. Below I have re-written the previous Bart code to use Cortex:
I chose a single GPU VM, with instance size g4dn.xlarge as suggested in the Cortex documentation, but it might be possible to run this with other GPUs (K80 in a P2 instance).
Then run “cortex deploy” to deploy the cluster on AWS. At the end of the deployment process you will get an AWS endpoint URL which serves as the interface to the summarizer model.
HTML DOM selector (Serverless)
To get accurate summaries the correct text needs to be provided to the summarization service. This means excluding certain parts of the html and selecting only the section of the page that needs summarizing.
My current approach uses a mix of HTML node counting, elimination, and DOM selection. First I find the parent node with the most children, and then extract the text based on the selector (e.g. div#root section div.article). I’m still tweaking this selection logic, but it works on roughly 80% of the sites I’ve tested. I will cover this in greater detail in another post. For now let’s focus on the serverless platform where the HTML selection logic is hosted.
The selector logic runs as an AWS Lambda function, accessible via an API endpoint. AWS Lambda is one of many technologies that utilize a serverless architecture. Others include Google Cloud Functions and Azure Functions, to name a few. In a serverless architecture, code is written as an event-driven cloud-hosted function that runs without the need to maintain or manage a server. Here’s an example of a Lambda function written using the Serverless Framework. We pass a “url” parameter to it:
In comparison, here is a node.js hello world app.
One major difference between the two is that serverless functions exit as soon as they finish executing. This is in contrast to the node.js example above that runs indefinitely until you force it to exit (Ctrl-C). This is more of a design choice rather than a note on advantage (or disadvantage). I will discuss this briefly at the end of this post.
In AWS, like many platforms, configurability can introduce complexity to an otherwise straightforward task. One such example is the need to manually set up an API gateway to expose a Lambda endpoint. The public endpoint is an optional step because AWS resources can operate within the confines of the AWS environment. However it’s time-consuming to set this up if you’re going to wire up public endpoints for all of your projects.
This is where the Serverless Framework comes in. Serverless (https://serverless.com) is a set of open source tools and services to help you manage and deploy cloud functions. After installation, there are two CLI commands to execute to get a hello world Lambda template up and running in AWS:
sls create --template aws-nodejssls deploy
To define an AWS endpoint for the summarize function, I simply add this to our serverless.yaml file under “functions”, and run `sls deploy` again:
After a successful deployment, the AWS endpoint URL will be returned via the serverless CLI.
Queue Service (AWS SQS)
The summarization part of this project (HTML DOM selector and Cortex) can take between 2 and 10 seconds to reply to a summarization request, depending on the amount of text in the webpage. This becomes problematic for clients with timeout requirements outside of this window.
Slack, for example, times out after 3 seconds. I’ve observed that after 3 seconds, Slack retries the request to the summarizer service until it gets a response within that time frame.
So if a user types “@tldr https://someurl” and the summary service fails to reply within 3 seconds, Slack retries multiple times on the user’s behalf. Once the summary service catches up, it sends a reply for each of those requests, resulting in the echo effect in the image above.
To fix this, I used a queue service — AWS Simple Queue Service (SQS).
A queue allows us to hand over the processing to the summary service. This is done by sending the request to the queue and immediately replying to the Slack user with empty text.
Here’s a Serverless function that handles the Slack code:
In order for another Lambda function to receive the notifications, SQS can be configured as an event source for that Lambda function. In this case, the receiver is the HTML DOM Selector serverless function.
Setting this up via the Serverless Framework is straightforward. In the serverless.yaml config file, add the ARN of your queue under the “sqs” directive for the function you want to designate as the receiver of the message payload:
With this, our Lambda function gets triggered when the payload is sent to SQS (actually it is polling behind the scenes, but you can read about that later):
Slack App (Serverless)
The bot allows the user to send a URL to the summary service. While there are a variety of bot platforms and providers we can use, the general setup is the same — the bot client communicates with a backend app that you create. The backend receives the user message and builds a response from it.
There are two aspects of bot development that I’ll briefly cover here.
- The bot platform’s* SDK
- The bot platform’s API
A good example of the importance of the SDK is in filtering bot event notifications sent to your backend. Instead of manually filtering them yourself with code (e.g. Was @tldr mentioned, or was the message intended for someone else?), you can instead rely on the SDK to do the heavy-lifting for you. Here is an example from the Slack node.js SDK showing direct messages to the bot handled with a few lines of code (used in combination with event configuration in Slack):
A robust public REST API is equally important for many reasons, one of which is flexibility. In my case, I only needed to send a message back to the user after a long wait from the summary service. Instead of importing the SDK, I can directly call the Slack REST API:
The other benefit of a robust bot API is the availability for 3rd-party libraries to build on top of it. This allowed me to use this Slack Serverless Boilerplate code to handle OAuth flows and token management in AWS DynamoDB, among other things. If not for this boilerplate, I’d have to write my own code to handle auth and manually set up a database for different bot instances.