Christophe Pere is a senior NLP researcher and a Deepflow advisor. His post was originally published on Medium.


In my previous article (Model Selection in Text Classification), I presented a way to select a model by making a comparison between classical machine learning and deep learning for the binary text classification problem.

The notebook is structured to run automatically with cross-validation all the algorithms and shows the results for the different metrics leaving the user free to select the algorithm according to his needs.

Here, the notebook was created to show the different metrics and the learning curves for the binary or multiclass problem. The notebook will define automatically in which configuration the dataset is.

The notebook and scripts are available here: GitHub


The goal is to use different models and see their behavior during the train and test. When is it possible, the algorithms have early stopping implemented to avoid overfitting. For deep learning, it is possible to use pretrained model during the training to reduce the time to train.

Two things are important at the beginning:
- The text column name to classify
- The label column name

The pipeline has been created to take into account the binary classification or multiclass classification without human in the loop. The pipeline extract the number of labels and determine if it’s a binary problem or multiclass. All the algorithms and metrics will switch to one from another automatically.

The notebook begins with a list of parameters used to test the models you want. The gist below shows this list. Lang determines if the notebook need to call the Translator() API from Google to detect the language of the dataset (default English).

The corresponding python function:

I — Data cleaning, text processing

The data used during this post are the same than the previous article, IMDB dataset (because it’s an open source dataset, more details here).

The data set is not perfectly clean and, in Natural language Processing task you need to clean your data for the problem you have. This step will influence the performance of your algorithms.

Concerning the IMDB dataset, what have I done?

  • Remove all upper case of a word and keep just the first (important for NER extraction)
  • Remove URL (if present)
  • Remove html balises
  • Remove emojis
  • Replace the contraction ‘ve by have
  • Replace the contraction n‘t by not

II — Polarity

The polarity of the text is estimated with the library TextBlob and plot in diagrams (made on the sample of 5000 raws).

III — Text information

All the reviews haven’t the same length, the same number of words. It’s interesting to study this information:

  • Extract the number of words
  • Extract the number of characters
  • Compute the density (number of char/number of words)
  • Extract the number of words which have the first letter in capital

We can easily plot the number of char with diagrams:

The reviews are mostly made of 550–600 characters.

It’s also interesting to look at the distribution of classes (labels).

The dataset is balanced, ~2500 reviews by label categories (binary).

VI — N-grams

N-grams are a technic which cut the sentences in tokens of n words, example:

I learn machine learning to become a data scientist

  • Unigram: [I, learn, machine, learning, to, become, a, data, scientist]
  • Bigrams: [(I learn), (learn machine), (machine learning), (learning to), (to become), (become a), (a data), (data scientist)]
  • Trigrams: [(I learn machine), (learn machine learning), (machine learning to), (learning to become), (to become a), (become a data), (a data scientist)]

I’m a fan of n-grams because I can show through them if the preprocessing was correct. In the notebook, you’ll find the word importance for unigram, bigrams, trigrams with and without stopwords (a word without importance) and a test on 5-grams.

For trigrams with stopwords:

And without stopwords:

The difference is pretty important. Text, generally, contain lot of stopwords, is not important for methods like TF-IDF or word embeddings but there is for One-Hot encoding because each word will have the same importance in a sentence.

What models and metrics do we use?

The pipeline is configured to use different models. Table 1 presents the machine learning and deep learning algorithms and the metrics used.

How it looks

I will show you two examples, the first, Stochastic Gradient Boosting and Shallow Neural Network.

The code to train a classical classifier is as follow:

The function takes in input a classifier and the data to fit the model.

For the metrics, the same kind of function has been created:

This function will show the different curve (Precision Recall, Rate True False Positive, ROC AUC), confusion matrix, Cohen’s Kappa (comparison between the model and how two annotators will do) and the accuracy.

Stochastic Gradient Boosting:

The best result for the SGD algorithm (Function 1) has been obtained with TF-IDF method.

if sgd: # does the computation if sgd = True
    print("\nStochastic Gradient Descent with early stopping for TF-IDF\n")
    print("Early Stopping : 10 iterations without change")
    metrics_ML(SGDClassifier(loss='modified_huber', max_iter=1000, tol=1e-3,   n_iter_no_change=10, early_stopping=True, n_jobs=-1 ),xtrain_tfidf, train_y, xvalid_tfidf, valid_y, gb=True)

The function metrics_ML() will call the function classifier_model() to train the model and compute the metrics. The easiest way to train a classifier and the metrics.

The results are:

Stochastic Gradient Descent with early stopping for TF-IDF

Early Stopping : 10 iterations without change
Execution time : 0.060 s
Score : 84.7 %

Classification Report

              precision    recall  f1-score   support

    negative       0.87      0.81      0.84       490
    positive       0.83      0.88      0.85       510

    accuracy                           0.85      1000
   macro avg       0.85      0.85      0.85      1000
weighted avg       0.85      0.85      0.85      1000
Model: f1-score=0.855 AUC=0.923
ROC AUC=0.919

Not so bad, with a score of 84.7%. Now could we do better?

Shallow Neural Network:

The code for the Shallow Neural Network has been presented in the previous article. Here again:

Each of the deep learning algorithms has been implemented in the same manner. All the code can be found in the notebook and the corresponding GitHub.

How to use it:

es = tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='auto', patience=3)
if shallow_network:
    model_shallow = shallow_neural_networks(word_index, pre_trained=pre_trained)
    history =, train_y,
                    epochs=1000, callbacks=[es],
                    validation_split=0.2, verbose=True)
    results = model_shallow.evaluate(valid_seq_x, valid_y)


Train on 3200 samples, validate on 800 samples
Epoch 1/1000
3200/3200 [==============================] - 2s 578us/sample - loss: 0.7117 - accuracy: 0.4837 - val_loss: 0.7212 - val_accuracy: 0.5175
Epoch 98/1000
3200/3200 [==============================] - 1s 407us/sample - loss: 0.4991 - accuracy: 0.9991 - val_loss: 0.5808 - val_accuracy: 0.88501000/1000 [==============================] - 0s 383us/sample - loss: 0.5748 - accuracy: 0.8590

The metrics:

History of the Shallow Neural Network
                precision    recall  f1-score   support

    negative       0.86      0.85      0.86       490
    positive       0.86      0.87      0.86       510

    accuracy                           0.86      1000
   macro avg       0.86      0.86      0.86      1000
weighted avg       0.86      0.86      0.86      1000

The balanced accuracy is : 85.88%

The Zero-one Loss is : 14.1%

Explained variance score: 0.436

ROC AUC=0.931
Model: f1-score=0.863 AUC=0.932
Cohen's kappa: 71.78%

So, the result is better with the neural network. However, the comparison cannot be made on a single run of each method. To correctly select models, you must use evaluation methods such as cross-validation (see: Model Selection in Text Classification).


The pipeline is automatic, you have just to configure the parameters at the beginning, choose the different algorithms you want to test and wait the results.

The goal is to show the different metrics by algorithm and methods (One-Hot encoding, TF-IDF, TF-IDF n-grams, TF-IDF char n-grams and word embeddings) and to select a class of algorithms you want to take for your problem. The next step will be to tune the hyperparameters and enjoy the results.

This work can help to quickly test NLP use-cases for Text classification, binary or multiclass without knowledge about the classes. The pipeline can take French texts or English texts.

The notebook and the classes are available on GitHub.

Next steps

  • Implement imbalanced methods to automaticaly balanced a dataset
  • Implement a Transformers classification model
  • Implement a Pre-trained transformers
  • Test NLP with Reinforcement Learning
  • Knowledge Graph
  • Use distributed Deep Learning
  • Use TensorNetwork to accelerate Neural Networks
  • Select a class of models with the right method and does hyperparameters tuning
  • Use Quantum NLP (QNLP)