Table of Contents

Introduction

Natural Language Processing (NLP) is driving many applications and tools that we use everyday such as translation, personal assistant applications or chatbots. Recent advances in NLP (BERT, ULMFiT, XLNet, etc.) allowed to build models that surpass human baseline performance on widely used NLP benchmarks like GLUE 1 for language understanding or SQuAD 2 for reading comprehension. In this post, we will focus on the problem of text classification which is essential in many applications, such as sentiment analysis, web search or information categorization.

Data Preparation

The amount of published news data is massive. News is available on different websites or tweets (500 million tweets per day3) and classifying this news is a challenging problem, where manual categorization is not feasible.

Data collection

To build a text classifier for news headlines, we first start by collecting data. We will need a collection of different news with associated labels (the type of news). News API https://newsapi.org allows us to search for articles from 30,000 worldwide sources with the possibility to specify the category of news. I was planning to use this tool to build my dataset but the limitation of the Developer Plan (500 requests per day and 1month historical data) are very limited to build a large dataset. An alternative solution consists of using a publicly available dataset from HuffPost4 that contains around 200k news headlines with corresponding categories.

Data Exploration and cleaning

All steps for data exploration and cleaning are provided in the notebook available at notebooks/data_prepation.ipynb

The initial dataset contains 41 unbalanced categories. The following pie chart provides the percentage of headlines associated with each category.

News categories and associated percentage
News categories and associated percentage

Many categories represent the same underlying category (like TASTE and FOOD & DRINK). We merge these categories and we end up with 34 categories.

The initial dataset contains different features like the author of the article or date of publication. Since we want to build a generic news classifier, we will only use the headline concatenated with the short description

df['text'] = df.headline + " " + df.short_description

After cleaning non-ascii characters, we can observe the distribution of the length of sentences.

len of sentence
Distribution of length of sentences

Text cleaning & Tokenization

To build simple features (like binary features, TF-IDF or embedding) from the text defined above, we have to do some preprocessing on the text. In our case, this processing consist of:

  • Stop word removal: remove the most frequent words that often do not carry much meaning, like ‘the’, ‘a’, ‘for’…
  • Lemmatization: return the base or dictionary form of a word.

This preprocessing can be achieved using NLTK or SPaCY libraries. In our case, I used SPACY to get tokens and lemmatize.

# define Spacy model, we ignore "tagger", "parser", "ner" to make processing faster
nlp = en_core_web_md.load(disable=["tagger", "parser", "ner"])

def keep_token(t):
    # remove stop words, punct, numbers
    return (t.is_alpha and
            not (t.is_space or t.is_punct or
                 t.is_stop or t.like_num))

def lemmatize_doc(doc):
    # Lemmatize
    return [ t.lemma_ for t in doc if keep_token(t)]

# apply on text filed to get clean_text
df['clean_text'] = df.text.apply(lambda x: ' '.join(lemmatize_doc(nlp(x.lower()))))

The dataset at this step consists of 3 columns: text, clean_text and category.

len of sentence
Head of data

For language model (discussed later), we will use other tokenization process adapted for specific algorithms/models.

Saving data - Feather

We use Feather (https://github.com/wesm/feather) to export data. Feather provides binary serialization for data frames and it is designed to make efficient reading and writing of data frames. Our dataset contains more than 200.000 observations. For faster machine learning experimentation and iterations, we save two versions of data:

  • A balanced dataset with 1004 observations per class. This dataset will be used to test models quickly and have faster iterations while developing ML solution.
  • The full dataset will be used to train the ML model that will be used in production.

Baseline model

Before jumping to complex models, we have to build a simple baseline. A basic machine learning model that does the required task. In the case of news classification problem, we will build a baseline using a term frequency–inverse document frequency (TF-IDF) for features extraction, combined with a simple classifier such as Logistic regression or SVM.

Features extraction - TF-IDF

Features extraction from text consists of representing documents (news) with numerical vectors. TF-IDF is one way to get this representation. It is seen as the product of term frequency and inverse document frequency. Term Frequency is the count of how many times a word occurs in a given document (a piece of news in our case). The Inverse Document Frequency is the number of times a word occurs in a corpus of documents. TF-IDF is used to weight words according to how important they are. Words that are used frequently in many documents will have a lower weighting while infrequent ones will have a higher weighting. This process can be split into two steps

  • Count Vectorizer5: Converts a collection of text data into a sparse matrix. Each raw will represent a document (news) and columns are IDs of the collection of words from all documents (vocabulary). $TF_{occurrence}(document_i, word_j) =$ occurrence of word j in document i . Scikit-learn implementation provides options on how to control frequencies thresholds for words to be kept amoung the vocabulary.
  • TF-IDF transformer6: Transforms the sparse matrix into tf-idf representation by weighting the term-frequencies as follows:

$$ w(document_i, word_j) = TF(document_i, word_j) * \log(\frac{1+N}{1+ df_j}) $$

where $df_j$ is the number of documents containing $word_j$ and $TF(document_i, word_j)$ is the term frequency defined as $TF _{occurence}$ /(sum of total occurences of word_j). Finally, we normalize raws of the sparse matrix (normalize vectors that represent each document).

This procedure (tf-idf) is applied to clean_text column obtained in the data preparation step.

Machine Learning classifier

  • Algorithms

After building a numerical representation of the text, we can try to train a classifier. We use two traditional and commonly used classifiers: SVM and Logistic regression.

1. Logistic Regression

The goal of logistic regression is to build a linear classifier that predicts probabilities of belonging to one class or another by modeling $p(y|x)$ (discriminative learning algorithm). In our case, we have multiple classes. One commonly used technique is to train a One-vs-All classifier for each class.

Let us imagine we have $K$ classes in our training dataset $(\bold{x _j}, y _j) _{j=1}^{N}$ where $y _j \in \lbrace 1,…,K \rbrace $.

For each class $i$, we will train a binary logistic regression (explained later) by considering the rest of the data as negative samples and all elements in class $i$ as positive samples. This results in $K$ classifiers defined as $h_{\theta _{k}}(\bold{x}) = P(y=k|\bold{x};\theta_k)$ and each classifier is parametrized by a set of parameters $\theta _{k}$.

To make a prediction on $\bold{x}$, we compute $K$ probabilities from $(h _{\theta _{k}}(\bold{x})) _{k=1}^{K}$, then the label associated with $\bold{x}$ is given

$$ y = \argmax _{i} h _{\theta _{i}}(\bold{x}) $$

Let us consider now the case of binary logistic regression where $y_i \in \lbrace 0,1\rbrace$, the probabilities $p(y|x)$ are given by the following model:

$$ p(y=1|\bold{x} ; \theta) = h _{\theta}(\bold{x}) = \frac{1}{1+e^{ \theta^T \bold{x}}} $$ $$ p(y=0|\bold{x} ; \theta) = 1 - h _{\theta}(\bold{x}) $$

which can be written in a compact way as

$$ p(y|\bold{x};\theta) = h _{\theta}(\bold{x})^y (1-h _{\theta}(\bold{x}))^{(1-y)} $$

To find the optimal $\theta$, we maximize the likelihood of the $$N observed data points

$$ \theta^* = \argmax _{\theta} \prod _{i=1}^{i=N} p(y _i|\bold{x _i};\theta) $$

which is equivalent to minimize the negative log-likelihood

$$ \theta^* = \argmin _{\theta} -\sum _{i=1}^{i=N}\log(p(y _i|\bold{x _i};\theta)) $$

and by replacing $p(y|x;\theta) = h _{\theta}(\bold{x})^y (1-h _{\theta}(\bold{x}))^{(1-y)}$ in the negative log-likelihood we find the loss function $l(\theta)$ (logistic loss function) to minimize in order to find the optimal parameter $\theta^*$

$$ l(\theta) = -\sum _{i=1}^{i=N}y _i \log(h _{\theta}(\bold{x _i})) +(1-y _i) \log(1-h _{\theta}(\bold{x _i}) ) $$

One should note that we ignored the bias in previous computation but the formulation of the problem remains the same. To include the bias $b$ we can consider a new parameter $\beta = (\theta, b)$ such that $\beta^T \bold{x} = \theta^T \bold{x} +b$ and then minimize the loss $l(\beta)$. Also an L1 or L2 regularization term can be added to the loss function.

If we consider the case where $y_i \in \lbrace -1, 1 \rbrace$, we obtain a slightly different formulation of the loss function but the optimization problems remain equivalent. Detailed computation are given in this article.

There is exist another approach to handle multi-class classification called Multinomial Logistic Regression and it is based on using a multinomial distribution instead of binomial distribution in the case of binary classification.


2. Support Vector Machine 7

For the SVM classifier, we only consider the case of binary classification. A multi-class classifier can be obtained by using One-vs-All classifiers as explained previously.

  • Separable data

Let us consider a training dataset $(\bold{x _i}, y _i) _{i=1}^{N}$ where $y _i \in \lbrace -1, 1 \rbrace$ and the data points are linearly separable (we will discuss the case where data is not separable later).

SVM is a linear classifier that aims to find a hyperplane (subspace whose dimension is one less than that of observations $\bold{x}$) that maximizes the distance to the closest data points from both classes. It is the hyperplane with a maximum margin.

A hyperplane is defined by $(w,b)$ as a set of points $H = \lbrace \bold{x} | w^T\bold{x} + b = 0 \rbrace$. The margin $\gamma$ is defined as the distance from the hyperplane $H$ to the closest point from both classes.

Let us consider a point $\bold{x}$ and $\bold{x _p}$ its projection on the hyperplane $H$. Let $\bold{d}$ be a vector of minimum length from $\bold{x _p}$ to $\bold{x}$. $$ \bold{d} = \bold{x} - \bold{x _p} $$

$\bold{d}$ is orthogonal to $H$, therefore $\bold{d}=\alpha w$ for some $\alpha \in \mathbb{R}$.

svm
(Left) Maximum margin hyperplane, $\gamma$ is the margin. (Right) Distance between $x$ and the hyperplane. Ref. [7]

Since $\bold{x _p} \in H$, we have $w^T \bold{x _p} +b = 0$. By replacing $\bold{x _p}$ with $(\bold{x} - \bold{d})$ we can write

$$ w^T (\bold{x} - \bold{d}) +b = w^T (\bold{x}-\alpha w) +b = 0 $$

Which implies $\alpha = \frac{w^T\bold{x} +b}{w^Tw}$ and the distance between $\bold{x}$ and $\bold{x _p}$ is given by the norm of $\bold{d}$ that can be written as

$$ ||\bold{d}|| _2 = (\bold{d}^T \bold{d})^{1 / 2} = (\alpha^2 w^Tw)^{1 / 2} = \frac{|w^T\bold{x} +b|}{||w|| _2} $$

The margin $\gamma$ with respect to some training data points $(\bold{x _i}) _{i=1}^{i=N}$ can be written as

$$ \gamma(w,b) = \min_{\bold{x _i}} \frac{|w^T\bold{x _i} +b|}{||w||_2} $$

Armed with this definition, we can formulate the problem of margin maximization. For a training dataset $(\bold{x _i},y _i) _{i=1}^{i=N}$ such that $y _i \in \lbrace -1, 1 \rbrace$, if $y_i(w^T \bold{x _i}+b) \geq 0$, then the classifier (defined by the hyperplane) makes a correct prediction.

The objective is to maximize the margin $\gamma$ under the constraints that all predictions are correct.

$$ \max_{w,b} \gamma(w,b) \text{ such that } \forall {i} \text{ } y_i(w^T \bold{x _i} +b) \geq 0 $$

By plugging the definition of the margin in this optimization problem we obtain

$$ \max _{w,b} \min _{\bold{x _i}} \frac{|w ^T \bold{x _i} +b|}{||w|| _2} \text{ s. t. } \forall {i} \text{ } y _i(w^T \bold{x _i}+b) \geq 0 $$

Since the hyperplane is scale-invariant, in the sense that we can multiply $w$ and $b$ by any real non-null value and still define the same hyperplane, we can rescale $w$ and $b$ such that $\min_{\bold{x _i}} |w ^T\bold{x _i} +b|=1$. Adding this constraint, the optimization problem can be written as

$$ \max _{w,b} \frac{1}{||w|| _2} \text{ s. t. } \forall {i} \text{ } y _i(w^T \bold{x _i}+b) \geq 0 \text{ and } \min _{\bold{x _i}} |w ^T\bold{x _i} +b|=1 $$

Which is equivalent to

$$ \min_{w,b} w^Tw \text{ s. t. } \forall {i} \text{ } y_i(w^T \bold{x _i}+b) \geq 1 $$

This is a quadratic optimization with linear constraints and the solution will define the optimal hyperplane to separate the data.

  • Data is not separable

In the case where there is no separating hyperplane between the two classes, we allow the constrains $y_i(w^T \bold{x _i}+b) \geq 1$ to be violated by introducing a variable $\epsilon_i \geq 0$ and re-writing the constraint as $y_i(w^T \bold{x _i}+b) \geq 1-\epsilon_i$. This means some inputs $\bold{x _i}$ are allowed to be closer to the hyperplane or be on the wrong side. The optimizing problem can be written as

$$ \min _{w,b} w^Tw + C \textstyle\sum _{i=1}^{N} {\epsilon _i } $$ $$ \text{ s. t. } \forall {i} \text{ } y _i(w^T \bold{x _i}+b) \geq 1 - \epsilon _i \; \text {and } \forall {i} \text{ } \epsilon _i \geq 0 $$

In the case where $C \neq 0$, if $y_i(w^T \bold{x _i}+b) \geq 1$ then $\epsilon=0$. And if $y_i(w^T \bold{x _i}+b) < 1$ then $\epsilon = 1 - y_i(w^T \bold{x _i}+b)$, which means

$$ \epsilon_i = \max(1- y_i(w^T \bold{x _i}+b); 0) $$

If we plug this closed form of $\epsilon$ in the optimization problem, we obtain the unconstrained version of SVM as a sum of hinge-loss and L2 regularizer

$$ \min _{w,b} w^Tw + C \sum _{i=1}^{N} \max(1- y _i(w^T \bold{x _i}+b); 0) $$


  • Metrics8

To evaluate a Machine Learning model, we need to define a metric(s). The dataset we are using is an unbalanced dataset with 34classes. Looking at the accuracy only can be biased and uninformative. In the case of unbalanced classes, the macro-F1 score is a widely used metric which combines precision and recall in the same metric.

Let us take a simple example with 3 classes to understand precision, recall and F1-score.

Real Label 1 2 3 2 3 3 1 2 2
Prediction 2 2 1 2 1 3 2 3 2

For each class, we will define True Positive (TP), False Positive (FP) and False Negative (FN). Then we compute precision, recall and F1-score for each class and take their average to get the macro scores, or weighted average by occurrence ratio of each class to get weighted scores. TP, FP and FN are defined as follows:

TP: The samples that were predicted to have the correct label.

FP: The samples that were predicted that are different from real labels.

FN: The samples from real labels that got a wrong predictions.

Here is a better view of these quantities using the confusion matrix9

svm
Confusion matrix for multiclass classification

From the example above and for the first column, we have a wrong prediction. This counts as an FP for class 2 and an FN for class 1. For the second column, we got the right prediction so that counts as a TP for class 2.

Here is a summary of TP, FN and FP for each class

  • class 1: TP =0 ; FN = 2 ; FP = 2 ;
  • class 2: TP = 3 ; FN = 1 ; FP = 2 ;
  • class 3: TP = 1 ; FN = 2 ; FP =1.

Then we can compute the precision, recall and F1 score for each class as follows, and take the average over classes to get macro-metrics.

  • Precision P = TP / (TP + FP) measures the classifier’s ability to only predict really positive samples as positive.
  • Recall R = TP / (TP + FN) measures the amount of positive test samples that were actually classified as positive.
  • F1 = 2*P*R / ( P + R).

In addition to F1-score, we will use top3-accuracy (having true label in top 3 predictions) to evaluate performance of the models.


  • Parameters search and training process

After defining the algorithms and the evaluation metric, we can start the process of training and parameters search. The goal is to find the best hyper-parameters that will allow us to generalize well on the testing dataset.

For SVM, the hyper-parameter is the penalty $C$ of the error term, while for logistic regression it is the regularization strength. We will use cross-validation 10 with grid search approach to find the best parameters. A test set is hold-out and will be used for final evaluation and the training set is split into K folds. For each model that is defined by a hyper-parameter, we follow the following procedure:

  • Train the model on k−1 folds;
  • Compute performances (metrics) on the remaining fold.

This procedure is repeated for each split and the average of performances is reported.

svm
5-folds for parameter search. Ref. [10]

For each value from the grid of hyper-parameters, we will obtain one average metric computed using the cross-validation procedure. The best hyper-parameter is then the one that gives the highest metric. Finally, we evaluate the best model on the testing dataset.

The script sklearn_models/params_search.py implements this procedure with the ability to choose hyper-parameters space and save the logs of the search process.

#############################################################
# Logistic Regression - Performance of best model on test dataset
            precision    recall  f1-score   support
accuracy                           0.67     20086
macro avg      0.62      0.52      0.55     20086
weighted avg   0.65      0.67      0.65     20086

Top 3 accuracy on test dataset:
0.8648

#############################################################
# SVM - Performance of best model on test dataset
# NB: For SVM, we used CalibratedClassifierCV(LinearSVC(C=***))
# in order to obtain probabilities and compute top3-accuracy

            precision    recall  f1-score   support
accuracy                            0.67     20086
macro avg       0.62      0.53      0.56     20086
weighted avg    0.66      0.67      0.66     20086

Top 3 accuracy on test dataset:
0.8640

The approach described above is a simple way of building a baseline. Other solutions can be considered to improve performances like balancing classes, using TF-IDF weighted by embedding vectors, ensembling of different classifiers, etc.

Language Model

After building a baseline, we try to improve the performance by using a modern approach for text classification based on Deep Learning. In 2018, J. Howard and S. Ruder suggested a new transfer learning approach for NLP problems called Universal Language Model Fine-tuning for Text Classification (ULMFit)11. ULMFit allows us to build models to solve a specific NLP task while using pre-trained language model (LM) that were trained on large text corpus.

ULMFit consists of three stages

svm
ULMFit Overview

1. LM pre-training

The first step of ULMFit consists of training a deep learning language model to predict the next word in a sequence on a large text dataset. In this case, the WikiText-103 dataset with a content of 103 million words was used to train the state-of-the art language model AWD-LSTM (https://arxiv.org/pdf/1708.02182.pdf). Training the language model is performed only once and the resulting LM can be used for different tasks. At the end of this stage, the model has learned the general characteristics of the language and the structure of the English sentences as provided in the training dataset.

2. LM fine-tuning

This step is specific to the NLP task that we want to solve. By using the pre-trained language model from the previous step, we will fine-tune the LM on our target data. The authors suggested to use a discriminative fine-tuning that consists on using a specific learning rate for the parameters of each layer and slanted triangular learning rates (STLR) where the learning rate first linearly increases and then linearly decays according to specific update schedule given in equation (3) in the original paper11. The architecture of the language model is given as follows

[AWD_LSTM(
(encoder): Embedding(vocab_size, 400, padding_idx=1)
(encoder_dp): EmbeddingDropout( (emb): Embedding(vocab_size, 400, padding_idx=1) )
(rnns):
ModuleList(
(0): WeightDropout( (module): LSTM(400, 1152, batch_first=True) )
(1): WeightDropout( (module): LSTM(1152, 1152, batch_first=True) )
(2): WeightDropout( (module): LSTM(1152, 400, batch_first=True) ) )
(input_dp): RNNDropout()
(hidden_dps):
ModuleList(
(0): RNNDropout()
(1): RNNDropout()
(2): RNNDropout() ) ),
LinearDecoder(
(decoder): Linear(in_features=400, out_features=vocab_size, bias=True)
(output_dp): RNNDropout() )
]

3. Classifier fine-tuning

After fine-tuning the LM on our target data, we expand the model by two linear layers, with softmax on the last layer in order to have a probability distribution over the classes of the dataset. Each of these layers uses batch normalization, dropout, and ReLu activation. The resulting classifier is fine-tuned on the target task using gradual unfreezing and STLR. The authors explain in their paper that training all layers of the classifier at the same time could result in “catastrophic forgetting”. They suggest to gradually unfreeze the layers of the model starting from the last layer.

We first unfreeze the last layer and fine-tune all unfrozen layers for one epoch. We then unfreeze the next lower frozen layer and repeat, until we fine-tune all layers until convergence at the last iteration

svm
Detailed overview of ULMFit from [11]

For implementation, we use fast.ai 12 library as it provides necessary blocks and tools to fine-tune ULMFit. Code for training and saving the model is provided in ulmfit/01_ulmfit_balanced_dataset.ipynb. The model trained on the full dataset can be downloaded from here.

# ULMFit - Performance on test dataset
            precision    recall  f1-score   support
micro avg                           0.73     20086
macro avg       0.66      0.61      0.63     20086
weighted avg    0.72      0.73      0.72     20086

Top 3 accuracy on test dataset:
0.9044

Deployment

To bring the trained models to the user, we will use Flask alongside Gunicorn to build a simple API that takes a headline generated from the testing dataset or given by the user and outputs top 3 predictions. The inference is done on the CPU.

  • For packaging of sklearn models, we use Joblib to export the trained models. For the inference, we create a model package app/ml_models/model_package_sklearn.py that takes the raw text and do all preprocessing (as it was done during the training) and then generates the predictions.

  • For ULMFit, we also create a model package app/ml_models/model_package_ulmfit.py that will use the pertained model and do necessary preprocessing on the raw text, then returns the associated top 3 predictions.

We use Docker to create a container for the application and then deploy it on a cloud instance or a serverless container service such as Google Run.

Details of deployment are given in the Github repository https://github.com/imadelh/NLP-news-classification

Conclusion

In this post, we build a Machine Learning model for text classification using a classic approach such as TF-IDF, and then we illustrated how to improve the performances using more modern models like ULMFit. The solutions suggested in this post can be seen as the first iteration in a Machine Learning project. An elaborated study of the problem, as well as other algorithms, can help us achieve better performances. Finally, here is a list of different problems that NLP can solve: Text Classification and Categorization (this post), Named Entity Recognition, Part-of-Speech Tagging, Semantic Parsing and Question Answering, Paraphrase Detection, Language Generation and Multi-document Summarization, Machine Translation, Speech Recognition, Character Recognition, Spell Checking.

References