Email Spam Detection using Pre-Trained BERT Model : Part 1 - Introduction and Tokenization
Recently I have been looking into Transformer based machine learning models for natural language tasks. The field of NLP has changed tremendously in last few years and I have been fascinated by the new architectures and tools that they are come out in same time. Transformer models is one of such architecture.
As the frameworks and tools to build transformer models keeps evolving, the documentation often become stale and blog posts often confusing. So for any one topic, you may find multiple approaches which can confuse beginner.
So as I am learning these models, I am planning to document the steps to do few of the important tasks in simplest way possible. This should help any beginner like me to pickup transformer models.
In this two part series, I will be discussing about how to train a simple model for email spam classification using pre trained transformer BERT model.This is the first post in series where I will be discussing about transformer models and preparing our data. You can read all the posts in the series here.
Transformer is a neural network architecture first introduced by Google in 2017. This architecture has proven extremely efficient in learning various tasks. Some of the popular models of transformer architecture is BERT, Distilbert, GPT-3, chatGPT etc.
You can read more about transformer models in below link
Pre-Trained Language Model and Transfer Learning
A pre-trained language model is a transformer model, which is trained on large amount of language data for specific tasks.
The idea behind using pre-trained model is that, model has really good understand of language which we can borrow for our nlp task as it is and just focus on training unique part of task in our model. This is called as transfer learning. You can read more about transfer learning in below link
Google Colab is a hosted jupyter python notebook which has access GPU runtime. As these transformer models perform extremely well on GPU, we are going to use google colab for our examples. You can get community version of same by signing in using your google credentials.
First step to install libraries. These libraries come from huggingface, a company that provides tools for simplifying building transformer based models.
- transformer library provides all the pre trained models and tools to train a model
- datasets library provides tool to load and use datasets in form required by above models
- evaluate a helper library to calculate metrics for training
Email Spam Data and Preparation
In this section of the post, we will be discussing about our spam data and it’s preparation.
1. Spam Data
For our example, we are going to use the email spam data from below link
The data has two important fields
- v2 - Content of Email
- v1 - Label which indicates spam or not
Please download data from kaggle and upload to your instance of google colab.
2. Loading Data to Dataframe
3. Mapping the Labels
In the data, labels are “ham” and “spam”. We need to map them to 0 and 1. The below code does the same.
4. Generating Different Datasets
Once we have mapped labels, we will be creating train, test and validate sets.
In the above code, we use the Dataset.from_pandas to create hugging face compatible datasets which will be using in next steps.
To use any pre-trained model, one of the pre requisites is that we need to use tokenization of the model on our dataset. This will make sure that the model can take our data as input.
1. Download Tokenizer
First step in the tokenization is to download right tokenization model.
Hugging face transformer library provides the helper class called AutoTokenizer. This class provides method from_pretrained which will help to download the tokenization model from hugging face repository. The model we are using base bert model trained on uncased data.
2. Tokenize Datasets
Once tokenizer is downloaded and ready to use, we can tokenize our datasets.
In above code, we use tokenize_function which selects the right column which has the text data. Then using the map function tokenization will be applied for each batch.
Complete code for the post is in below google colab notebook.
You can also access python notebook on github.
In this post, we understood what are transformation models. We also prepared our dataset to have model tokenization. In the next post, we will see how to fine tune the model.