Christophe Pere is a senior NLP researcher and a Deepflow advisor. His post was originally published on Medium.


At the beginning, there was a simple problem. My manager came to me to ask if we could classified mails and associated documents with NLP methods.

Sound not very funky but I’ll start with thousands of sample. The first thing asked was to use “XGBoost” because: “We can do everything with XGBoost”. Interesting job if data science comes down to XGBoost…

After implementing a notebook with different algorithms, metrics and visualization something was still in my mind. I couldn’t select betweens the different models in one run. Simply because you can have luck with one model and don’t know how to reproduce the good accuracy, precision etc…

So at this moment I ask myself, how to do a model selection ? I’ll look on the web, read posts, articles and so on. It was very interesting to see the different manners to implement this kind of thing. But, it was blurry when considering neural network. At this moment, I had one thing in mind, how to compare classical methods (multinomial Naïve Bayes, SVM, Logistic Regression, boosting …) and neural networks (shallow, deep, leatm, rnn, cnn…).

I present here a short explanation of the notebook. Comments are welcome.

The notebook is available on GitHub: here
The notebook is available on Colab: here

How to start ?

Every project start with and exploratory data analysis (EDA in short), followed directly by preprocessing (the texts were very dirty, signatures in mails, url, mails header etc…). The different functions will be present in the github repository.

A quick way to see if the preprocessing is correct is to determine the most common n-grams (uni, bi, tri… grams). Another post will guid you in this way.


We will apply the method of model selection with IMDB dataset. If you are not familiar with the IMDB dataset, it’s a dataset containing movie reviews (text) for sentiment analysis (binary — positive or negative).
More details can be found here. To download it:

$ wget

$ tar -xzf aclImdb_v1.tar.gz

Vectorizing methods

One-Hot encoding (Countvectorizing):
It’s the method where words will be replaced by vectors of 0 and 1. The goal is to take a corpus (important volume of words) and make a vector of each unique words containing in the corpus. After, each word will be projected in this vector where 0 indicates non existent while 1 indicates existent.

       | bird | cat | dog | monkey |
bird   |  1   |  0  |  0  |    0   |
cat    |  0   |  1  |  0  |    0   |
dog    |  0   |  0  |  1  |    0   |
monkey |  0   |  0  |  0  |    1   |

The corresponding python code:

# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}'[TEXT])  # text without stopwords

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)

Term frequency-inverse document frequency is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus (source: tfidf).
This method is powerful when dealing with an important number of stopwords (this type of word are not relevant for the information → I, me, my, myself, we, our, ours, ourselves, you… For the English language). The idf term permit to reveal the important words and rare words.

# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=10000)[TEXT])
xtrain_tfidf =  tfidf_vect.transform(train_x_sw)
xvalid_tfidf =  tfidf_vect.transform(valid_x_sw)

TF-IDF n-grams:
The difference with the previous tf-idf based on one word, the tf-idf n-grams take into account n successive words.

# ngram level tf-idf
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=10000)[TEXT])
xtrain_tfidf_ngram = tfidf_vect_ngram.transform(train_x_sw)
xvalid_tfidf_ngram =  tfidf_vect_ngram.transform(valid_x_sw)

TF-IDF chars n-grams:
Same as previous method but the focus is in character level, the method will focus on n successive characters.

# characters level tf-idf
tfidf_vect_ngram_chars = TfidfVectorizer(analyzer='char', ngram_range=(2,3), max_features=10000)[TEXT])
xtrain_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(train_x_sw) xvalid_tfidf_ngram_chars =  tfidf_vect_ngram_chars.transform(valid_x_sw)

Pre-trained model — FastText

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. It works on standard, generic hardware. Models can later be reduced in size to even fit on mobile devices. (source: here)

How to load fastText ? From the official documentation:

$ git clone
$ cd fastText
$ sudo pip install .
$ # or :
$ sudo python install

Download the right model. You have models for 157 languages here. To doanload the English model:

$ wget

& unzip

When the doanload is complete load it in python:

pretrained = fasttext.FastText.load_model('crawl-300d-2M-subword.bin')

Word Embeddings or Word vectors (WE):

Another popular and powerful way to associate a vector with a word is the use of dense “word vectors”, also called “word embeddings”. While the vectors obtained through one-hot encoding are binary, sparse (mostly made of zeros) and very high-dimensional (same dimensionality as the number of words in the vocabulary), “word embeddings” are low-dimensional floating point vectors (i.e. “dense” vectors, as opposed to sparse vectors). Unlike word vectors obtained via one-hot encoding, word embeddings are learned from data. It is common to see word embeddings that are 256-dimensional, 512-dimensional, or 1024-dimensional when dealing with very large vocabularies. On the other hand, one-hot encoding words generally leads to vectors that are 20,000-dimensional or higher (capturing a vocabulary of 20,000 token in this case). So, word embeddings pack more information into far fewer dimensions. (source: Deep Learning with Python, François Chollet 2017)

How to map a sentence with int numbers:

# create a tokenizer
token = Tokenizer()
word_index = token.word_index

# convert text to sequence of tokens and pad them to ensure equal length vectors
train_seq_x = sequence.pad_sequences(token.texts_to_sequences(train_x), maxlen=300)
valid_seq_x = sequence.pad_sequences(token.texts_to_sequences(valid_x), maxlen=300)

# create token-embedding mappingembedding_matrix = np.zeros((len(word_index) + 1, 300))
words = []
for word, i in tqdm(word_index.items()):
	embedding_vector = pretrained.get_word_vector(word)
    if embedding_vector is not None:
    	embedding_matrix[i] = embedding_vector

Model Selection

What is model selection in computer science ? Specifically in the field of AI ? Model selection is the process of choosing between different machine learning approaches. So in short, different models.

But, how could we compare them ? To do that we need metrics (see this link for more details). The dataset will be shrunk in train and test parts (validation set will be determined in the deep learning models).

What sort of metrics do we used in this binary classification model selection ?

For classification we will use the terms:
- tp: True positive prediction
- tn: True negative prediction
- fp: False positive prediction
- fn: False negative prediction
Here a link for more details.

  • Accuracy: All positive predictions on all predictions
    (tp + tn) / (tp + tn + fp + fn)
  • Balanced Accuracy: It is defined as the average of recall obtained on each class for imbalanced dataset
  • Precision: The precision is intuitively the ability of the classifier not to label as positive a sample that is negative tp / (tp + fp)
  • Recall (or sensitivity): The recall is intuitively the ability of the classifier to find all the positive samples tp / (tp + fn)
  • f1-score: The f1 score can be interpreted as a weighted average of the precision and recall -> 2 * (precision * recall) / (precision + recall)
  • Cohen kappa: It’s a score that expresses the level of agreement between two annotators on a classification problem. So if the value is less than 0.4 is pretty bad, between 0.4 and 0.6 it,s equivalent to human, 0.6 to 0.8 it’s a great value, more than 0.8 it’s exceptional.
  • Matthews Correlation Coefficient: The Matthews correlation coefficient (MCC) is used in machine learning as a measure of the quality of binary (two-class) classifications […] The coefficient takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation. (source: Wikipedia)
  • Area Under the Receiver Operating Characteristic Curve (ROC AUC)
  • Time fit: The time needed to train the model
  • Time Score: The time needed to predict results

Great !! We are our metrics. What’s next ?


To be able to realize a robust comparison between models we need to validate the robustness of each model.

The figure below shos that, the global dataset need to be separate in train and test data. Train data to train the model and test data to test the model. The cross-validation is the process of seperate the dataset in k-fold. k is the number of proportions we need to realize on the data.

Source: sklearn

Generally, k is 5 or 10 it will depend on the size of the dataset (small dataset small k, big dataset big k).

The goal is to compute each metric on each fold and compute their average (mean) and the standard deviation (std) .

In python this process will be done with the cross_validate function in scikit-learn.

What models will we compare ?

Models in the running

We will test machine learning models, deep learning models and NLP specialised models.

Machine Learning Models

  • Multinomial Naïve bayes (NB)
  • Logistic Regression (LR)
  • SVM (SVM)
  • Stochastic Gradient Descent (SGD)
  • k-Nearest-Neighboors (kNN)
  • RandomForest (RF)
  • Gradient Boosting (GB)
  • XGBoost (the famous) (XGB)

Deep Learning Models

  • Shallow Neural Network
  • Deep neural network (and 2 variations)
  • Recurrent Neural Network (RNN)
  • Long Short Term Memory (LSTM)
  • Convolutional Neural Network (CNN)
  • Gated Recurrent Unit (GRU)
  • Bidirectional RNN
  • Bidirectional LSTM
  • Bidirectional GRU
  • Recurrent tional Neural Network (RCNN) (and 3 variations)

That’s all but, 25 models is not bad.

Let’s present little code

Machine Learning

I will use cross_validate() function in sklearn (version 0.23) for classic algorithms to take multiple-metrics into account. The function below, call report, permit to take a classifier, X,y data and a custom list of metrics and compute the cross-validation on them with the argument. It returns a dataframe containing values for all the metrics and the mean and the standard deviation (std) for each of them.

How to use it ?

Here an exemple for multinomial Naive Bayes:

The term if multinomial_naive_bayes is present because this code is part of the notebook with parameters (boolean) at the beginning. All the code is available in GitHub and Colab.

Deep Learning

I haven’t found a function like cross_validate for deep learning, only posts about using k-fold cross-validation for neural networks. here I will share a custom cross_validate function for deep learning with the same input and output of the report function. It will permit to have the same metrics and to compare the all models together.

The goal is to take a neural network function as:

And cal this function inside the cross_validate_NN function. All the Deep Learning received the same implementation and will compare. For the full implementation of the different models go to the notebook.


When all the model computed the different folds and metrics we can easily compare them with the dataframe. On the IMDB data set the model performing better is:

Here I just show the results for accuracy > 80% and for accuracy, precision and recall metrics.

How to improve ?

  • Build a function with in argument a dictionary of models and concat all the work in a loop
  • Distributed Deep Learning
  • To use TensorNetwork to accelerate Neural Networks
  • To use GridSearch for Hyper-tuning
  • Implement Transformers with HuggingFace