Christophe Pere is a senior NLP researcher and a Deepflow advisor. His post was originally published on Medium. Cover picture: Miriam Espacio — Pexels

A notebook containing all the code is available here: GitHub you’ll find code to generate different types of datasets and neural networks to test the loss functions.

After the post about activation function, we will dive into the second block, the loss, or objective function for neural networks.

To understand what is a loss function, here is a quote about the learning process:

A way to measure whether the algorithm is doing a good job — This is necessary to determine the distance between the algorithm’s current output and its expected output. The measurement is used as a feedback signal to adjust the way the algorithm works. This adjustment step is what we call learning.
François Chollet, Deep learning with Python (2017), Manning, chapter 1 p.6

The loss function is the function that computes the distance between the current output of the algorithm and the expected output. It’s a method to evaluate how your algorithm models the data. It can be categorized into two groups. One for classification (discrete values, 0,1,2…) and the other for regression (continuous values).

What are the commonly used loss functions to train a Neural Network?


  • Cross-entropy
  • Log loss
  • Exponential Loss
  • Hinge Loss
  • Kullback Leibler Divergence Loss
  • Mean Square Error (MSE — L2)
  • Mean Absolute Error (MAE — L1)
  • Huber Loss

All the code needs these packages:

%matplotlib inline
import keras.backend as K
import numpy as np
import matplotlib.pyplot as plt



This function comes from information theory where the goal is to measure the difference between two averages of the number of bits of distribution of information. The cross-entropy as the Log Loss function (not the same but they measure the same thing) computes the difference between two probability distribution functions.

Entropy is the number of bits required to transmit a randomly selected event from a probability distribution. A skewed distribution has low entropy, whereas a distribution where events have equal probability has a larger entropy.
In information theory, we like to describe the “surprise” of an event. Low probability events are more surprising therefore have a larger amount of information. Whereas probability distributions where the events are equally likely are more surprising and have larger entropy.
- Skewed Probability Distribution (unsurprising): Low entropy.
- Balanced Probability Distribution (surprising): High entropy.

The cross-entropy is a class of Loss function most used in machine learning because that leads to better generalization models and faster training.

Cross-entropy can be used with binary and multiclass classification problems (many classes with one label, different from many classes with multilabel called multilabel classification).

Types of cross-entropy:

  • Binary cross-entropy: for binary classification problem
  • Categorical cross-entropy: binary and multiclass problem, the label needs to be encoded as categorical, one-hot encoding representation (for 3 classes: [0, 1, 0], [1,0,0]…)
  • Sparse cross-entropy: binary and multiclass problem (the label is an integer — 0 or 1 or … n, depends on the number of labels)

Range of values for this class of Loss function:

  • 0.00: Perfect probabilities
  • < 0.02: Great probabilities
  • < 0.05: In a good way
  • < 0.20: Great
  • > 0.30: Not great
  • 1.00: Hell
  • > 2.00 Something is not working


The Log-Loss is the Binary cross-entropy up to a factor 1 / log(2). This loss function is convex and grows linearly for negative values (less sensitive to outliers). The common algorithm which uses the Log-loss is the logistic regression.

Negative log-likelihood for binary classification problems is often shortened to simply “log loss” as the loss function derived for logistic regression.
- log loss = negative log-likelihood, under a Bernoulli probability distribution
For classification problems, “log loss“, “cross-entropy” and “negative log-likelihood” are used interchangeably.
More generally, the terms “cross-entropy” and “negative log-likelihood” are used interchangeably in the context of loss functions for classification models.

Exponential Loss

The exponential loss was design at the beginning of the Adaboost algorithm which greedily optimized it. The mathematical form is:
exp_loss = 1/m * sum(exp(-y*f(x)))

And can be coded like this:

def exponential_loss(y_pred, y_true):
    return np.mean(np.exp(- y_pred * y_true))

The result can be shown below:

Exponential Loss vs misclassification (1 if y<0 else 0)

Hinge Loss

The Hinge loss function was developed to correct the hyperplane of SVM algorithm in the task of classification. The goal is to make different penalties at the point that are not correctly predicted or too closed of the hyperplane.

It’s mathematical formula is Hinge = max(0, 1-y*f(x)) and the corresponding code:

def Hinge(y_pred, y_true):
    return np.max([0., 1. - y_pred * y_true])

The result of this equation is convex but non-differentiable, it needs a sub-gradient algorithm to be optimized.

The result is shown below:

Hinge Loss vs misclassification (1 if y<0 else 0)

Kullback Leibler Divergence Loss

The KL divergence is the score of two different probability distribution functions. The KL difference between a PDF Q and a PDF P is noted KL(Q||P) where || means divergence.

KL(Q||P) = -sum( Q(x) * log(P(x)/Q(x)) or sum(Q(x)*log(Q(x)/P(x))

This means that the divergence increases if the PDF of Q is large and the PDF of P is small for the same data. In machine learning, you can represent this as the difference between the prediction and the ground truth.

The code below shows how to use the KL divergence with predictions and ground truth:

def kl_divergence(y_true, y_pred):
    return y_true * np.log(y_true / y_pred)

A simple visualization can be shown here:

Source: Wikipedia
As such, the KL divergence is often referred to as the “relative entropy.”
- Cross-Entropy: Average number of total bits to represent an event from Q instead of P.
- Relative Entropy (KL Divergence): Average number of extra bits to represent an event from Q instead of P.


Mean Square Error Loss (also called L2 regularization)

It’s the square difference between the current output y_pred and the expected output y_true divided by the number of output. The MSE function is very sensitive to outliers because the difference is a square that gives more importance to outliers. If we had to predict one value for all targets, the prediction should be the mean.

This is expressed like this:

def mean_square_error(y_true, y_pred):
    return K.mean(K.square(y_true-y_pred), axis=-1)

We can visualize the behavior of the MSE function comparing a range of values (here -10000 to 10000) with a constant value (here 100):

The behavior of Mean Square Error Loss

The behavior is a quadratic curve especially useful for gradient descent algorithm. The gradient will be smaller closed to the minima. MSE is very useful if outliers are important for the problem if outliers are noisy or bad data or bad measures you should use the MAE loss function.

Mean Absolute Error Loss (also called L1 regularization)

At the difference of the previous loss function, the square is replaced by an absolute value. This difference has a big impact on the behavior of the loss function which has a “V” form. The gradient is the same at each point, even when the values are closed to the minima (can create a jump). It needs to modify dynamically the learning rate to reduce the step closed to the minima. The MAE function is more robust to outliers because it is based on absolute value compared to the square of the MSE. It’s like a median, outliers can’t really impact her behavior.

You can implement it easily like this:

def mean_square_error(y_true, y_pred):
    return K.mean(K.abs(y_true-y_pred), axis=-1)

We can visualize the behavior of the MAE function comparing a range of values (here -10000 to 10000) with a constant value (here 100):

The behavior of Mean Absolute Error Loss

Mean Square Logarithmic Error

Huber Loss

Huber Loss is a combination of MAE and MSE (L1-L2) but it depends on an additional parameter call delta that influences the shape of the loss function. This parameter needs to be fine-tuned by the algorithm. When the values are large (far from the minima) the function has the behavior of the MAE, closed to the minima, the function behaves like the MSE. So the delta parameter is your sensitivity to outliers. The mathematical form of the Huber Loss is:

We can implement this function in two manners, here I present a function where all the blocks are explicit.

# custom huber loss function 
def huber_loss_error(y_true, y_pred, delta=0.1):
    res = []
    for i in zip(y_true, y_pred):
        if abs(i[0]-i[1])<=delta:
            res.append(delta*((abs(i[0]-i[1]) )-0.5*(delta**2)))
 # can also be write as:
 # np.where(np.abs(y_true-y_pred) < delta, 0.5*(y_true-y_pred)**2 , delta*(np.abs(y_true-y_pred)-0.5*delta))
    return res # np.sum(res)

We can visualize the behavior of the Huber loss function comparing a range of values (here -10 to 10) with a constant value (here 0):

Behavior of Huber Loss

Huber loss permits to have a large gradient for large numbers but a decreasing gradient when values become smaller. But, this function needs fine-tuning delta but it’s computationally expensive. To avoid this you can use the Log-Cosh Loss (not explained in this article but, you can see in the next plot the difference between them).

Comparaison between the Huber loss function (delta =1) and Log-Cosh Loss function