Multi Class Text Classification With Deep Learning Using BERT

By Susan Li

Image for post
Photo credit: Pexels
Susan Li

Most of the researchers submit their research papers to academic conference because its a faster way of making the results available. Finding and selecting a suitable conference has always been challenging especially for young researchers.

However, based on the previous conferences proceeding data, the researchers can increase their chances of paper acceptance and publication. We will try to solve this text classification problem with deep learning using BERT.

Almost all the code were taken from this tutorial, the only difference is the data.

The Data

The dataset contains 2,507 research paper titles, and have been manually classified into 5 categories (i.e. conferences) that can be downloaded from here.

Explore and Preprocess

conf_explore.py
Image for post
Table 1
df['Conference'].value_counts()
Image for post
Figure 1

You may have noticed that our classes are imbalanced, and we will address this later on.

Encoding the Labels

label_encoding.py
Image for post
df['label'] = df.Conference.replace(label_dict)

Train and Validation Split

Because the labels are imbalanced, we split the data set in a stratified fashion, using this as the class labels.

Our labels distribution will look like this after the split.

train_test_split.py
Image for post
Figure 2

BertTokenizer and Encoding the Data

Tokenization is a process to take raw texts and split into tokens, which are numeric data to represent words.

  • Constructs a BERT tokenizer. Based on WordPiece.
  • Instantiate a pre-trained BERT model configuration to encode our data.
  • To convert all the titles from text into encoded form, we use a function called batch_encode_plus , and we will proceed train and validation data separately.
  • The 1st parameter inside the above function is the title text.
  • add_special_tokens=True means the sequences will be encoded with the special tokens relative to their model.
  • When batching sequences together, we set return_attention_mask=True, so it will return the attention mask according to the specific tokenizer defined by the max_length attribute.
  • We also want to pad all the titles to certain maximum length.
  • We actually do not need to set max_length=256, but just to play it safe.
  • return_tensors='pt' to return PyTorch.
  • And then we need to split the data into input_ids, attention_masks and labels.
  • Finally, after we get encoded data set, we can create training data and validation data.
tokenizer_encoding.py

BERT Pre-trained Model

We are treating each title as its unique sequence, so one sequence will be classified to one of the five labels (i.e. conferences).

  • bert-base-uncased is a smaller pre-trained model.
  • Using num_labels to indicate the number of output labels.
  • We don’t really care about output_attentions.
  • We also don’t need output_hidden_states.
BERT_pretrained_model.py

Data Loaders

  • DataLoader combines a dataset and a sampler, and provides an iterable over the given dataset.
  • We use RandomSampler for training and SequentialSampler for validation.
  • Given the limited memory in my environment, I set batch_size=3.
data_loaders.py
  • To construct an optimizer, we have to give it an iterable containing the parameters to optimize. Then, we can specify optimizer-specific options such as the learning rate, epsilon, etc.
  • I found epochs=5 works well for this data set.
  • Create a schedule with a learning rate that decreases linearly from the initial learning rate set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial learning rate set in the optimizer.
optimizer_scheduler.py

We will use f1 score and accuracy per class as performance metrics.

performance_metrics.py

Training Loop

training_loop.py
Image for post
Figure 3
loading_evaluating.py
Image for post
Figure 4

Jupyter notebook can be found on Github. Enjoy the rest of the weekend!