Fast.ai Lesson 4: Natural Language Processing (NLP) Fundamentals: The Engine Behind LLMs
Tokenization, Numericalization, and Training Your First Classifier Model with Fastai and Hugging Face
In the previous chapter, We designed a image classifier from scratch. This chapter, which corresponds to Lesson 4 of the video course and Chapter 10 of the book, shifts our focus from image classification to Natural Language Processing (NLP). NLP is the bedrock of powerful Large Language Models (LLMs) like ChatGPT and Gemini.
In this tutorial, we will explore:
Text Preprocessing
Tokenization (including using Fastai)
Numericalization
Training a Classifier Model
Handling Sequential Inputs
Unlike the fixed 28×28 pixel inputs of image classifiers, language inputs (words and sentences) have variable lengths. We must devise a strategy for our models to process these sequential inputs.
Converting images to numbers was simple—we used the pixel values. For text, which our models must also process numerically, we need a robust method to convert text into a numerical format.
Self-Supervised Learning
When training models to predict the next word in a sequence, we don't need external labels. By taking the first part of a sentence as the input and predicting the next word, the data labels itself. This technique is known as self-supervised learning.
Similar to the pre-trained models we used in image classification, NLP benefits from pre-trained models. These models are often trained on vast corpora of text, such as all of Wikipedia, allowing us to leverage prior knowledge.
Text Preprocessing
The first step in NLP is preparing the text. We start by concatenating all our documents into one long string and then breaking it down into individual units called "Tokens" (words, subwords, etc.).
In a self-supervised task like next-word prediction, we define our variables as follows:
Independent Variable (X): The sequence of words starting from the first word up to the second-to-last word.
Dependent Variable (Y): The sequence of words starting from the second word up to the very last word.
This allows the model to input the first word to predict the second, then input the first two words to predict the third, and so on.
The main steps we'll take to build our NLP model are:
Tokenization: Converting all documents into a list of words, subwords, or characters.
Numericalization: Mapping each token to its index in a predefined dictionary, or Model Vocabulary, to convert text into numbers.
Language Model Data Loader: Using Fastai's
LMDataloader
to automatically handle the text input and create the independent and dependent variables.Model Creation: Employing a model capable of handling variable-sized inputs, such as a Recurrent Neural Network (RNN).Tokenisation
Tokenization
Tokenization is the process of breaking down text. The challenge lies in dealing with punctuations, special symbols, and combined words like "don't." Furthermore, languages like Chinese don't have clear word boundaries, and others combine multiple words into one. There is no one-size-fits-all approach.
The three primary tokenization techniques are:
1. Word-Based Tokenization
Sentences are split using spaces as the primary separator. Most punctuation signs (dot, comma, dash, etc.) are also treated as separators.
2. Sub-Word Based Tokenization
Words are divided into the most frequently occurring substrings. This approach has gained significant traction as it is language-agnostic, working well with languages like Chinese and Japanese that lack spaces.
3. Character-Based Tokenization
The sentence is split into individual characters.
Tokenization with Fastai
For word tokenisation, we have a class WordTokenizer
in fastai. Fastai use Spacy
as its default tokenizer. We can call Spacy directly but it’s better to use WordTokenizer because it will pick by default the best and most recent tokenizer used by fastai.
spacy = WordTokenizer()
first(spacy(['The U.S. dollar $1 is $1.00.']))
# ['The','U.S.','dollar','$','1','is','$','1.00','.']
As you can see that the tokenizer is small enough to keep the dots(.) in U.S. and $1.00 together rather than breaking the sentence. Fastai then adds some additional functionality on the top of that with the Tokenizer
class. For example - fastai add the token “xxbos” to show beginning of the text, add “xxmaj” if next word starts with capital letter, add “xxnuk” if the next word in unknown.
Choosing the right size is crucial: too small, and too many words will be replaced by the unknown token; too large, and it will consume too much memory, making the model harder to train.
A common practice is to replace words with a low frequency (e.g., less than 3 occurrences) with the unknown token (xxunk
). This keeps the vocabulary size manageable and ensures the model has enough exposure to a word to understand its true meaning.
sp = SubwordTokenizer(vocab_sz=10000)
sp.setup(txts)
It’s very easy to implement character level tokenisation by increasing the vocab size in the subwordtokeniser. Once we have created the tokens, we will need to convert the tokens into numbers.
Numericalisation
Numericalisation is the process of mapping the tokens with numbers. This happens in two steps -
First, we create a list of the whole vocabulary.
Then, we replace every token with its index in the vocab list.
tkn = Tokenizer(spacy)
toks = txt.map(tkn)
num = Numericalize()
num.setup(toks)
Fastai have default values of min_freq=3 and max_vocab=60000 for Numericalize().
Training a classifier model
We are going to create a submission for U.S. phrase to phrase matching kaggle competition. In this competition, We are provided with a csv file with each row having two phrases, their matching score, and category. we are tasked with comparing two phrases, and scoring them based on whether they're similar or not, based on which category they were used in. With a score of 1
it is considered that the two inputs have identical meaning, and 0
means they have totally different meaning. For instance, abatement and eliminating process have a score of 0.5
, meaning they're somewhat similar, but not identical.
We can convert this to a classification problem with 3 classes - Identical, similar and Different. We are going to leverage HuggingFace Transformers for this problem.
First let’s load all the data from csv into a pandas data frame. Then we will create a string named “input” including both phrases and the category.
import pandas as pd
df = pd.read_csv('/kaggle/input/us-patent-phrase-to-phrase-matching/train.csv')
df['input'] = 'Text1:' + df.context + '; Text2:' + df.target + '; Anc1:' + df.anchor
Next, we will convert it to a Dataset for batching processing and random shuffling.
from datasets import Dataset, DatasetDict
ds = Dataset.from_pandas(df)
We are going to use “Microsoft/deberta-v3-small” model which is pre-trained for this task. We need to make sure that we use the same vocab used by the model, otherwise the results will be garbage.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_nm = 'Microsoft/deberta-v3-small'
tokz = AutoTokenizer.from_pretrained(model_nm)
def tokenize(x):
return tokz(x['input'])
tok_ds = ds.map(tokenize, batched=True)
Transformers need to have a column named “labels”, so we will need rename our “score” column to “labels”. We will divide our dataset into two parts - train_set and validation_set.
tok_ds = tok_ds.rename_columns({'score': 'labels'})
dds = tok_ds.train_test_split(0.25, seed=42)
Now, we have our data ready and we can train our model on it.
from transformers import TrainingArguments, Trainer
bs = 128
lr = 8e-5
epochs = 4
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
eval_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
num_train_epochs=epochs, weight_decay=0.01, report_to='none')
We need not worry about args other than the three we declared above. These are quite standard and we rarely need to change them. We will use the Pearson correlation coefficient as our evaluation metric, which is excellent for measuring the linear relationship between two variables—perfectly suited for this similarity task.
import numpy as np
def corr(x,y): return np.corrcoef(x,y)[0][1]
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'], tokenizer=tokz, compute_metrics=corr_d)
trainer.train()
As you can see, most of the stuff is taken care the libraries and we don’t have to do that stuff manually. But it’s good to understand what’s going on behind the scenes. You can access my whole notebook here.
What broke down
The notebook that Jeremy provided with the lecture is quite good and runs almost without any issues. I just faced only one issue with parameters we are passing in TrainingArguments.
Instead of evaluation_strategy
, we need to pass eval_strategy.
With this one change, you are good to go.
Wrap up
That’s a wrap up guys. In this chapter, we explored how NLP pipelines work — from text preprocessing to tokenisation, numericalisation, and finally training a transformer-based classifier using Huggingface.
Key takeaways:
Text needs special handling since it’s not fixed-size.
Tokenisation strategies vary (word, subword, char).
Numericalisation is essential to bridge text and models.
Huggingface + Fastai simplify NLP model training.
P.S. If you enjoyed this post, consider subscribing to my mailing list. I’ll be writing one blog for each chapter of the Fast.ai course, along with my experiments, as wells as on career advice and startups.
Don’t worry—I won’t spam you with cat memes… unless you’re into that 🐱. Just practical insights, resources, and maybe a joke or two to keep things fun.