Contents

Getting Started With NLP + Transformers

squirrel grading essays, AI-generated using DALL·E

Introduction

This tutorial goes over natural language processing (NLP) using deep learning in Python for people who are completely new to this topic. I’ll be using the Hugging Face Transformers library (docs) with a pre-trained model and applying it to a current Kaggle competition — the English Language Learners competition. In this competition, non-native English language proficiency by 8th-12th graders is assessed using a dataset of written essays. This technique of using deep learning to give automated feedback can be useful, resulting in faster feedback for students and reducing teachers’ grading burden.

This is part of the work I’ve done for Jeremy Howard’s fast.ai 2022 Part 1 course. If you haven’t heard of this course, I recommend checking it out as a great introduction to deep learning.

To follow along this tutorial, create a Kaggle notebook linked to this competition so that you can access the dataset and online computing resources.

Data Exploration

Load in the training dataset to a pandas DataFrame, and print a list of columns to check out their non-null count and data types:

path = '../input/feedback-prize-english-language-learning/'

import pandas as pd
df = pd.read_csv(path + 'train.csv')

print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3911 entries, 0 to 3910
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   text_id      3911 non-null   object 
 1   full_text    3911 non-null   object 
 2   cohesion     3911 non-null   float64
 3   syntax       3911 non-null   float64
 4   vocabulary   3911 non-null   float64
 5   phraseology  3911 non-null   float64
 6   grammar      3911 non-null   float64
 7   conventions  3911 non-null   float64
dtypes: float64(6), object(2)
memory usage: 244.6+ KB
None

There are 3911 rows in the training set. Looks like there are no missing values — great! text_id and full_text are strings whereas the 6 essay scoring measures are floats.

Now I’m going to look at the distribution of values for the 6 scoring measures:

df.describe()

Looks like the average is roughly 3 points. Note that the scores range from 1.0 to 5.0 in increments of 0.5.

Print out some sample rows to get an idea of what the dataset looks like:

df.sample(n=5, random_state=1)

You can see that text_id is a unique identifier for each student essay, and full_text is the column that contains the full text of each essay.

One Hot Encoding

Each scoring measure can take on values from 1.0 to 5.0, in increments of 0.5. Since there is a limited number of possible values (as opposed to being continuous), I’m going to treat this as a classification problem. Accordingly, I need to perform one hot encoding to make dummy variables representing each outcome (e.g., grammar-1.0, grammar-1.5, grammar-2.0, …). Each dummy variable will have a value that is either 0 or 1 to denote if it is a member of that category.

Make dummy variables:

df_original = df.copy()

target_cols = ['cohesion', 'syntax', 'vocabulary', 'phraseology', 'grammar', 'conventions']
df = pd.get_dummies(df, columns=target_cols, dtype='float64')

print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3911 entries, 0 to 3910
Data columns (total 56 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   text_id          3911 non-null   object 
 1   full_text        3911 non-null   object 
 2   cohesion_1.0     3911 non-null   float64
 3   cohesion_1.5     3911 non-null   float64
 4   cohesion_2.0     3911 non-null   float64
 5   cohesion_2.5     3911 non-null   float64
 6   cohesion_3.0     3911 non-null   float64
 7   cohesion_3.5     3911 non-null   float64
 8   cohesion_4.0     3911 non-null   float64
 9   cohesion_4.5     3911 non-null   float64
 10  cohesion_5.0     3911 non-null   float64
 11  syntax_1.0       3911 non-null   float64
 12  syntax_1.5       3911 non-null   float64
 13  syntax_2.0       3911 non-null   float64
 14  syntax_2.5       3911 non-null   float64
 15  syntax_3.0       3911 non-null   float64
 16  syntax_3.5       3911 non-null   float64
 17  syntax_4.0       3911 non-null   float64
 18  syntax_4.5       3911 non-null   float64
 19  syntax_5.0       3911 non-null   float64
 20  vocabulary_1.0   3911 non-null   float64
 21  vocabulary_1.5   3911 non-null   float64
 22  vocabulary_2.0   3911 non-null   float64
 23  vocabulary_2.5   3911 non-null   float64
 24  vocabulary_3.0   3911 non-null   float64
 25  vocabulary_3.5   3911 non-null   float64
 26  vocabulary_4.0   3911 non-null   float64
 27  vocabulary_4.5   3911 non-null   float64
 28  vocabulary_5.0   3911 non-null   float64
 29  phraseology_1.0  3911 non-null   float64
 30  phraseology_1.5  3911 non-null   float64
 31  phraseology_2.0  3911 non-null   float64
 32  phraseology_2.5  3911 non-null   float64
 33  phraseology_3.0  3911 non-null   float64
 34  phraseology_3.5  3911 non-null   float64
 35  phraseology_4.0  3911 non-null   float64
 36  phraseology_4.5  3911 non-null   float64
 37  phraseology_5.0  3911 non-null   float64
 38  grammar_1.0      3911 non-null   float64
 39  grammar_1.5      3911 non-null   float64
 40  grammar_2.0      3911 non-null   float64
 41  grammar_2.5      3911 non-null   float64
 42  grammar_3.0      3911 non-null   float64
 43  grammar_3.5      3911 non-null   float64
 44  grammar_4.0      3911 non-null   float64
 45  grammar_4.5      3911 non-null   float64
 46  grammar_5.0      3911 non-null   float64
 47  conventions_1.0  3911 non-null   float64
 48  conventions_1.5  3911 non-null   float64
 49  conventions_2.0  3911 non-null   float64
 50  conventions_2.5  3911 non-null   float64
 51  conventions_3.0  3911 non-null   float64
 52  conventions_3.5  3911 non-null   float64
 53  conventions_4.0  3911 non-null   float64
 54  conventions_4.5  3911 non-null   float64
 55  conventions_5.0  3911 non-null   float64
dtypes: float64(54), object(2)
memory usage: 1.7+ MB
None

As you can see, all of the dummy variables are now present in the DataFrame df in 54 different columns. However, Transformers assumes that a single column called labels contains all of the “labels” (i.e., the correct answers), so I’m going to create a labels column to hold all 54 numbers for a given essay:

labels = df.columns[2:]
df['labels'] = [df.iloc[i][labels].to_numpy() for i in df.index]
df = df.drop(labels=labels, axis=1)
df

Looks good! You can see the newly added labels column on the right hand side of the DataFrame, with a vector representing score predictions. I’m going to print the first row:

df.iloc[0]['labels']
array([0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
       0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0,
       0.0, 0.0], dtype=object)

As expected, most values are 0.0 since only one category per scoring measure should have a value of 1.0.

Next, convert the DataFrame to the Dataset object that Transformers uses:

from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)
ds
Dataset({
    features: ['text_id', 'full_text', 'labels'],
    num_rows: 3911
})

You can see that ds contains all of the DataFrame’s columns and the correct number of rows.

Tokenization & Numericalization

To perform tokenization and numericalization, the next step is to pick a pre-trained model. I want to use something small so that it runs fast (deberta-v3-small) and I can iterate quickly. There are several ways you can access this dataset. If internet is turned ON in your Kaggle notebook, you can use trained_model = 'microsoft/deberta-v3-small' in lieu of the code below. If internet is OFF, add this public Kaggle dataset to your notebook’s input data and use trained_model = '../input/debertav3small'. If you intend to submit your notebook to the competition’s Leaderboard, you will eventually need to turn off internet and so the latter option is best.

Create a tokenizer for this model:

trained_model = '../input/debertav3small'

from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(trained_model)

Encoding means translating text to numbers. It’s a 2-step process: (1) tokenization, (2) numericalization. Tokenization is when the text is split into tokens, which can be words or sub-words. Numericalization is when each token is converted into a number based on a given vocabulary.

This tokenizer can split the student essays, or any text, into tokens:

tokz.tokenize('Hello everyone! I hope you are having a wonderful day.')
['▁Hello',
 '▁everyone',
 '!',
 '▁I',
 '▁hope',
 '▁you',
 '▁are',
 '▁having',
 '▁a',
 '▁wonderful',
 '▁day',
 '.']

The underscore represents the start of a new word.

Now I’m going to tokenize every student essay in the full_text column using batched=True (speeds it up by applying the tokenization function tok_func to multiple elements of the dataset at once, not on each element separately). I’m also going to define tok_func with truncation on the maximum text length (to speed up training):

def tok_func(x):
    return tokz(x['full_text'], max_length=512, truncation=True)

tok_ds = ds.map(tok_func, batched=True)
tok_ds
Dataset({
    features: ['text_id', 'full_text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 3911
})

Notice that there are 3 new columns: input_ids, token_type_ids, and attention_mask

  • input_ids contains a vector of integers that corresponds to the tokens in each student’s essay, based on the tokenizer’s vocabulary
  • token_type_ids is usually used for classification on pairs of sentences or question-answering tasks, so 0s indicate the first sentence and 1s indicate the second sentence (for models that accept them)
  • attention_mask has 0s or 1s (think how you would use a Boolean mask), where tokens with 1s should be paid attention and tokens with 0s should be ignored; these are padding tokens for text that is shorter than the maximum text (since each text sequence needs to have the same length)

Training, Validation, and Test Sets

I’m going to split the dataset into 75% and 25% for the training set and validation set, respectively. I’m going to use a DatasetDict to hold these datasets since that’s what Transformers uses. Note that the “test” dataset below actually refers to the validation set, so I’m doing some renaming to make its purpose clear:

train_valid = tok_ds.train_test_split(0.25, seed=42)
ds_dict = DatasetDict({
    'train': train_valid['train'],
    'valid': train_valid['test']
})
ds_dict
DatasetDict({
    train: Dataset({
        features: ['text_id', 'full_text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 2933
    })
    valid: Dataset({
        features: ['text_id', 'full_text', 'labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 978
    })
})

I also need to prepare the test set. Repeat the processing steps (performed in the previous section) for the test set:

test_df = pd.read_csv(path + 'test.csv')

test_ds = Dataset.from_pandas(test_df).map(tok_func, batched=True)

Note that the test set only has 3 rows. Since this is a Kaggle code competition, the notebook itself is submitted and it will be internally run with a secret test set. This 3-row test set is just a formality to check that the notebook executes without error and can successfully create a submission.csv file.

Model Training

Now I’m almost ready to train the model! As described in the competition, the predetermined Kaggle metric is the mean columnwise root mean squared error (MCRMSE):

$$\textrm{MCRMSE} = \frac{1}{N_{t}}\sum_{j=1}^{N_{t}}\sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_{ij} - \hat{y}_{ij})^2}$$

where $N_t$ is the number of scored ground truth target columns (a total of 6), and $y$ and $\hat{y}$ are the actual and predicted values, respectively.

Since the MCRMSE is based on the actual scores (not the one-hot-encoded categories), I need to write a function to construct the scores from the one-hot-encoded columns by taking a weighted average:

# Creates a 54-element vector of all one-hot-encoded categories: 1.0, 1.5, 2.0, 2.5, ...
import numpy as np
levels_9 = np.arange(2, 11) / 2
levels_54 = np.tile(levels_9, 6)

def construct_scores(weights):
    levels_matrix = np.tile(levels_54, (len(weights), 1))
    weights_levels = weights * levels_matrix
    weighted_avg = np.array([np.sum(weights_levels[:, 0:9], axis=1), np.sum(weights_levels[:, 9:18], axis=1), 
                             np.sum(weights_levels[:, 18:27], axis=1), np.sum(weights_levels[:, 27:36], axis=1), 
                             np.sum(weights_levels[:, 36:45], axis=1), np.sum(weights_levels[:, 45:54], axis=1)])
    weighted_avg = np.transpose(weighted_avg)
    return weighted_avg

Next, I’m going to define a function called mcrmse_d that returns a dict, which maps strings (name of returned metric) to floats (the metric’s value), since that’s what Transformers expects. This function must take an EvalPrediction object, which is a named tuple with a predictions field and a label_ids field.

Before the MCRMSE metric can be calculated, I need to first convert the logits (raw output from the model; see Hugging Face Transformers course for more info) into probabilities using a sigmoid layer, then reconstruct the scores using the probabilities as weights.

Convert raw outputs to the MCRMSE metric:

import torch

def mcrmse_d(eval_pred):
    
    # Extract predictions and actual labels
    y = eval_pred.predictions
    yhat = eval_pred.label_ids
    
    # Sigmoid layer to output prediction probabilities between 0 and 1 for dummy variables
    sigmoid = torch.nn.Sigmoid()
    prob = sigmoid(torch.Tensor(y))
    prob = torch.Tensor.numpy(prob)
    
    # (Re)construct scores using a weighted average
    predic_scores = construct_scores(prob)
    actual_scores = construct_scores(yhat)
    
    # Calculate MCRMSE
    mcrmse = np.mean(np.sqrt(np.mean(np.square(predic_scores - actual_scores), axis=0)), axis=0)
    
    return {'mcrmse': mcrmse}

I’m going to define some training hyperparameters: batch size, number of epochs to run the model, and the learning rate. The batch size should fit the GPU I’m using, and I should select a small number of epochs so that I can iterate quickly. For the learning rate, I used trial-and-error to figure out the ideal learning rate. Feel free to experiment with different values. This is what I used:

bs = 16
epochs = 4
lr = 8e-5

Define a TrainingArguments class, which will contain all of the hyperparameters needed for training and evaluation:

from transformers import TrainingArguments, Trainer

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy='epoch', per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

The only required argument is the directory where the trained model will be saved; in this case, 'outputs'.

Create the model, using the trained_model we specified earlier as well as the number of labels:

model = AutoModelForSequenceClassification.from_pretrained(trained_model, problem_type='multi_label_classification', num_labels=54)

Define the Trainer class, where train_dataset refers to the training set and eval_dataset refers to the validation set:

trainer = Trainer(model, args, train_dataset=ds_dict['train'], eval_dataset=ds_dict['valid'], tokenizer=tokz, compute_metrics=mcrmse_d)

At this point, you should turn on the GPU setting in Kaggle, if not enabled already. Here’s my GPU info:

!nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    34W / 250W |   1315MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Finally, train the model:

trainer.train()

Training took about 9 minutes, and yielded a validation loss of 0.25 and MCRMSE of 0.53 at the last epoch. At this point, you can save the model if you wish: trainer.save_model("trainer").

Test Predictions

Now that I’ve concluded training, I’m going to calculate predictions on the test set:

preds = trainer.predict(test_ds).predictions.astype(float)

Recall that these outputs are one-hot-encoded values.

Convert to actual scores:

# Sigmoid layer to output prediction probabilities between 0 and 1 for dummy variables
sigmoid = torch.nn.Sigmoid()
prob = sigmoid(torch.Tensor(preds))
prob = torch.Tensor.numpy(prob)

# Reconstruct scores using a weighted average
predic_scores = construct_scores(prob)

Almost done!

Create a submission file:

import datasets

submission = datasets.Dataset.from_dict({
    'text_id': test_ds['text_id'],
    'cohesion': predic_scores[:, 0],
    'syntax': predic_scores[:, 1],
    'vocabulary': predic_scores[:, 2],
    'phraseology': predic_scores[:, 3],
    'grammar': predic_scores[:, 4],
    'conventions': predic_scores[:, 5]
})

submission.to_csv('submission.csv', index=False)

And that’s it! Since this is a code competition, we don’t need to upload a CSV file to Kaggle; instead, we submit the notebook and it will be run on a secret test set to determine scoring. The notebook is required to have a submission.csv output. If you run into errors during notebook submission, check out Code Competitions’ Errors & Debugging Tips.

At the time of my submission, the top score on the Leaderboard was 0.43. My submission scored 0.50. Examining the MCRMSE equation, this metric essentially gives the average difference between the correct essay grade and the predicted one. Getting it right within 0.5 is pretty good, considering how two humans (teachers) may not even agree within 0.5 for the same essay.

Resources