Most of the researchers submit their research papers to academic conference because its a faster way of making the results available. Finding and selecting a suitable conference has always been challenging especially for young researchers.
However, based on the previous conferences proceeding data, the researchers can increase their chances of paper acceptance and publication. We will try to solve this text classification problem with deep learning using BERT.
Almost all the code were taken from this tutorial, the only difference is the data.
Explore and Preprocess
You may have noticed that our classes are imbalanced, and we will address this later on.
Encoding the Labels
df['label'] = df.Conference.replace(label_dict)
Train and Validation Split
Because the labels are imbalanced, we split the data set in a stratified fashion, using this as the class labels.
Our labels distribution will look like this after the split.
BertTokenizer and Encoding the Data
Tokenization is a process to take raw texts and split into tokens, which are numeric data to represent words.
- Constructs a BERT tokenizer. Based on WordPiece.
- Instantiate a pre-trained BERT model configuration to encode our data.
- To convert all the titles from text into encoded form, we use a function called
batch_encode_plus, and we will proceed train and validation data separately.
- The 1st parameter inside the above function is the title text.
add_special_tokens=Truemeans the sequences will be encoded with the special tokens relative to their model.
- When batching sequences together, we set
return_attention_mask=True, so it will return the attention mask according to the specific tokenizer defined by the
- We also want to pad all the titles to certain maximum length.
- We actually do not need to set
max_length=256, but just to play it safe.
return_tensors='pt'to return PyTorch.
- And then we need to split the data into
- Finally, after we get encoded data set, we can create training data and validation data.
BERT Pre-trained Model
We are treating each title as its unique sequence, so one sequence will be classified to one of the five labels (i.e. conferences).
bert-base-uncasedis a smaller pre-trained model.
num_labelsto indicate the number of output labels.
- We don’t really care about
- We also don’t need
DataLoadercombines a dataset and a sampler, and provides an iterable over the given dataset.
- We use
RandomSamplerfor training and
- Given the limited memory in my environment, I set
- To construct an optimizer, we have to give it an iterable containing the parameters to optimize. Then, we can specify optimizer-specific options such as the learning rate, epsilon, etc.
- I found
epochs=5works well for this data set.
- Create a schedule with a learning rate that decreases linearly from the initial learning rate set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial learning rate set in the optimizer.
We will use f1 score and accuracy per class as performance metrics.