Auto-Generating Tags for Content using Amazon SageMaker BlazingText with fastText

By Yi Ai

Multi-label Text classification is one of the fundamental tasks in Natural Language Processing (NLP) applications. In this post, We will build a fastText model to predict the tag of text content and then deploy the model on SageMaker Hosting Services.

We need to use the notebook instance to create and manage Jupyter notebook so that we can prepare and process data and to train and deploy our model.

We will use the 10% of Stack Overflow Q&A dataset. The dataset includes:

  • Questions contain the title, body, creation date, closed date (if applicable), score, and owner ID for all non-deleted Stack Overflow questions whose Id is a multiple of 10.
  • Answers contain the body, creation date, score, and owner ID for each of the answers to these questions. The ParentId column links back to the Questions table.
  • Tags contain the tags on each of these questions

We only need Questions and Tags to train our model in this demo.

The Amazon SageMaker built-in BlazingText supports text classification (supervised mode) and Word2Vec vectors learning (Skip-gram, CBOW, and batch_skipgram modes). However, BlazingText doesn’t support multi-label classification (correct me if I’m wrong).

In our demo, multiple tags might be applied to the same content, thus instead of using BlazingText to build a model directly, we will train text classification model with fastText and host the pre-trained model using BlazingText.

fastText 0.2.0 added “OneVsAll” loss function for multi-label classification, which corresponds to the sum of binary cross-entropy computed independently for each label.

The following diagram is an overview of the entire process:

You can find the example notebook in my GitHub repo.

The input file is formatted in a way that each line contain a single sentence and the corresponding label(s) prefixed by “__label__”, i.e.

__label__database __label__oracle How to edit sessions parameters on Oracle 10g

Our dataset is basically are csv files of 19,999 questions and related tags. To train our model, The input data has to be as clean as possible, the following function generates a preprocessed and clean training data after removing HTML tags and unwanted punctuations.

Let’s download the most recent release:

!wget https://github.com/facebookresearch/fastText/archive/v0.2.0.zip
!unzip v0.2.0.zip
!cd fastText-0.2.0 && make

In the demo we will use SageMaker local mode to train the model, notebook ml.t2.medium instance is power enough for text classification model training since fastText allows to train models without requiring a GPU.

The following command is used to train a model for text classification:

!cd fastText-0.2.0 && ./fasttext supervised -input "../stackoverflow.train" -output stackoverflow_model -lr 0.5 -epoch 25 -minCount 5 -wordNgrams 2 -loss ova

At the end of the training, a file stackoverflow_model.bin, containing the trained classifier, is created in the current directory.

let’s test it on the validation data.

!cd fastText-0.2.0 && ./fasttext test stack_model.bin "../stackoverflow.validation"
N	531
P@1 0.566
R@1 0.215

Note that fastText has a lot of different input parameters for training, If you’ve ever tried to tune your model accuracy, you would see that changing these parameters changes model’s precision and recall dramatically. For more details check out the official doc here.

SageMaker can host the models trained using BlazingText or pre-trained models provided by fastText for real-time inference. Let’s deploy our model stack_model.bin :

Once the endpoint is deployed, it supports application/json as the content-type for inference. The payload should contain a sentence with the key as "instances" while being passed to the endpoint.

As expected, we get tags with probabilities for each of the sentences, we can now call the SageMaker model endpoint using Amazon API Gateway.

javascript, 0.8203935623168945 
jquery, 0.7201416492462158
firefox, 0.01690844297409057

We can view this endpoint from the Amazon SageMaker console. The default endpoint name looks like this “blazingtext-2019–06–13–01–41–24–632”.

Now we have a SageMaker model endpoint. Let’s call it from Lambda and API Gateway. First, install the Serverless Framework.

$ sls create --template aws-python3 --path text-classification

The directory that is created includes two files — handler.py is the Lambda function.

Paste the following code in your serverless.yml

This is what your handler.py file should look like right now:

To deploy your API, run the following command:

$ serverless deploy -v

We can now invoke serverless API endpoint using CURL:

$ curl -d '{"sentence":"How can I refresh a page with jQuery"}' -H "Content-Type: application/json" -X POST https://xxxxxx.execute-api.ap-southeast-2.amazonaws.com/dev/tag

That’s about it! I hope you have found this article useful, You can find the complete project in my GitHub repo.