Christophe Pere is a senior NLP researcher and a Deepflow advisor. His post was originally published on Medium. Cover picture: Markus SpiskeUnsplash

A notebook containing all the relevant code is available on GitHub.

I — Exploratory Data Analysis or commonly EDA

Yes, this is a new post among many that address the subject of EDA. This step is the most important of a Data Science project. Why? Because it allows you to acquire knowledge about your data, ideas, and intuitions to be able to model it later.

EDA is the art of making your data speak. Being able to control their quality (missing data, wrong types, wrong content …). To be able to determine the correlation between the data. To be able to know the cardinality.

But not only, EDA is not just about exploring data. When you have a target, a column containing label (supervised learning) you also have feature selection and Feature Importance. Without, you have Feature Extraction (unsupervised learning).

For years, the best way was to tirelessly code the same functions to calculate correlations, plot variables, manually explore the columns to calculate interesting variables, etc…

But now there are simpler, faster, and more efficient ways to do all of this:

Ia. Pandas-profiling

The first, pandas-profiling, can create reports in HTML format with a very nice interface of the content of a dataframe. Based on pandas, it allows with exceptional performance (up to a million lines, recommendation to be taken into account) to make a complete exploration of the data. This report can be integrated via a widget in jupyter lab or notebook. Or, it can also be presented as a frame.

As the authors indicate, you’ll the relative information:

Type inference: detect the types of columns in a dataframe.
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histograms
Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
Missing values matrix, count, heatmap, and dendrogram of missing values
Duplicate rows List the most occurring duplicate rows
Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data
source: pandas-profiling

You can find examples on the GitHub page of the library like:

  • NASA Meteorites landing this report is the output of the function profil_report() and it shows how powerful is this library.

How to use it? In a few line of code, let me show you:

It takes a few seconds to compute compare to something hardcoded to get an impressive result.

The result when you show the report in a widget:

Pandas-profiling profile report in widget (rendering)

The result when you show the report in a frame inside the notebook:

Pandas-profiling HTML in a frame (rendering)

Ib. Dataprep.eda

Another great library is dataprep with module eda. What is doing?

You have three main functions:

  • plot
plot function of the dataprep.eda package on the Boston House Prices data set

This function will show you a histogram for each feature. Each plot is interactive based on the bokeh library. You have different parameters that allow you to show information on the data you want.

  • plot_correlation

The function allows you to compute three sorts of the correlation matrix (Pearson, Spearman, and KendallTau). The advantage is that the plot is also interactive and you can see the values just putting the cursor on it.

plot_correlation on Boston House Prices data set
  • plot_missing

This last function is very interesting like the picture below shows you. It allows you to visualize where the missing data are in the column and the percentage of them.

plot_missing function

Ic. Sweetviz

The last interesting library is sweetviz. Based on pandas-profiling the library permits to compare different columns or the train and test part of your data to determine if the test set is representative of the train. Like pandas-profiling, you have tons of information per columns. The picture below shows the dashboard of the HTML report generated by the library.

Comparison between train and test with Sweetviz

II — Feature Selection

EDA is not just a focus on what is inside the data. You can also go deeper into the analysis with the following parts. The feature selection is a manner to reduce the number of features present in your dataset.

Here, I just present three ways to do it. The sklearn library has powerful modules to do what you want in terms of selecting or extracting data.

IIa. Removing Feature with low variance

This technique simply makes it possible to select the features which have a threshold lower than that fixed. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features, i.e. features that have the same value in all samples. In the code below, the columns with more than 80% missing data are automatically deleted. This method doesn’t look to the prediction variable y so it can be used in an unsupervised way.

from sklearn.feature_selection import VarianceThresholdthreshold = 0.8 # 80% of low variance

fe_low_variance = VarianceThreshold(threshold=(threshold * (1 - threshold)))
X_variance = fe_low_variance.fit_transform(X)

IIb. Univariate Selection

In supervised learning, you have a target feature (commonly named y). The goal of the univariate selection is simple, to take one feature to make a variation on it, and see how it affects the estimated target. In the end, the univariate select will select the best feature based on the univariate statistical test.

With sklearn, you have 4 methods to do it.

  • SelectKBest: This will select the best k (manually chooses by the user) features of your dataset and removes the others. This function needs a scorer, a metric function to apply the selection. The commonly used scorer function is chi2.
  • SelectPercentile: Same as SelectKBest you need to pass a scorer, but instead of a k number of features, you pass a percentile value.
  • SelectFpr/SelectFdr/SelectFwe: selection by the pvalues based on the false positive rate, the false discovery rate, and the family-wise error.
  • GenericUnivariateSelect: Here you can customize your estimator with configurable strategy.

In the code below I use SelecKBest with chi2 scorer:

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

#apply SelectKBest class to extract top 10 best features
select_best_features = SelectKBest(score_func=chi2, k=10) # where k is the number of features you want
fit = select_best_features.fit(X,y)
df_scores = pd.DataFrame(fit.scores_)
df_columns = pd.DataFrame(X.columns) # where X is your data
#concat two dataframes for better visualization
feature_scores = pd.concat([df_columns,df_scores],axis=1)
feature_scores.columns = ['Specs','Score']  #naming the dataframe columns
print(feature_scores.nlargest(10,'Score'))  #print 10 best features

IIc. Recursive Feature Elimination

As the picture below shows, the principle of the RFE is simple. The estimator fit the data and compute the feature importance, it’s the weight of the data on the target. At each iteration, the model will remove the feature with lower importance until reach the number of k features needed.

Schema of the RFE

How could we code this? I show here an implementation for SVM and Logistic Regression.

SVM:

from sklearn.svm import SVC
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_selection import RFECV

# SVM implementation
svc = SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct

rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(5),            scoring='accuracy')rfecv.fit(X, y)

print("Optimal number of features : %d" % rfecv.n_features_)

Logistic Regression:

# Feature Extraction with RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='lbfgs', max_iter=5000)
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

IId. SelectFromModel

This last method is a generalization of the previous because SelectFromModel takes an estimator and returns a new matrice containing the reduced dimension.

The code below shows how to implement it:

from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

lsvc = LinearSVC(C=0.01, penalty="l1", dual=False) # estimatorlsvc.fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
print(f"The new number of feature is {X_new.shape[1]}")

III — Feature Extraction

IIIa. Principal Component Analysis (PCA)

The Principal Component Analysis is a method used to reduce the dimension of a dataset. The principle is simple, the PCA will fit a line or a plane to the points to create another representation of the data.

PCA projection

The code is simple to use. You just have to specify the N_var parameter which represents the number of dimensions you want.

from sklearn.decomposition import PCA
N_var = 2
pca = PCA(n_components=N_var)
X_pca = pca.fit_transform(X)
df_pca = pd.DataFrame(data = X_pca, columns = ['PC1', 'PC2'])

IIIb. Independent Component Analysis (ICA)

ICA is a powerful technique to separate multivariable independent signals linearly mixed. This technique can permit us to separate different signals in signal processing.

The code below shows an implementation of a FastICA:

from sklearn.decomposition import FastICA
N_var = 2
ica = FastICA(n_components=N_var)
X_ica = ica.fit_transform(X)

IIIc. Linear Discriminant Analysis (LDA)

I share here the abstract of the original paper explaining LDA.

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model. (source: David M. Blei, Andrew Y. Ng and Michael I. Jordan, 2003[1])

A simple way to use it with sklearn:

from sklearn.discriminant_analysis import
LinearDiscriminantAnalysis
N_var = 2
lda = LinearDiscriminantAnalysis(n_components=N_var)

# run an LDA and use it to transform the features
X_lda = lda.fit(X, y).transform(X)

IIId. Locally Linear Embedding (LLE)

Like all the previous methods in feature extraction, the LLE is an unsupervised technique to reduce the dimension (n-Dimension to k-Dimension where k is determined). The goal is to preserve the geometric features of the original non-linear feature structure. LLE is based on the k-Nearest-Neighbors (k-NN) technique, the algorithm will compute clusters, estimates the center of each cluster and map it in a linear weighted representation. So the points contained inside the cluster are approximate in this vector that best reproduces the cluster.

The code to use it is like this:

from sklearn.manifold import locally_linear_embedding
N_var = 2
lle, error = locally_linear_embedding(X, n_neighbors=5, n_components=N_var, random_state=42, n_jobs=-1)

IIIe. t-distributed Stochastic Neighbor Embedding (t-SNE)

T-distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm for visualization developed by Laurens van der Maaten and Geoffrey Hinton.[2] It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. (source: Wikipedia)

t-SNE is commonly compared with PCA because the visual representation of the data is better with t-SNE than PCA. t-SNE separates more precisely the dimension than PCA.

A simple implementation is provided in sklearn:

from sklearn.manifold import TSNE
N_var = 2
X_embedded = TSNE(n_components=N_var).fit_transform(X)

VI — Feature Importance

VIa. Tree method

Each node in a tree algorithm is a way to represent the data. To split between values (probabilities) the feature into a single value. This representation leads to determine the weight of a feature on the target y (the feature you want to predict). So the algorithm can compute the importance of each feature on the target. You can use the feature_importance_ parameter of a tree model to show the value and the standard deviation of each feature.

Here I provide two ways to compute this feature importance using tree-based methods. The first using ExtraTreesClassifier (I used it a lot in the past to determine the importance of my features) and RandomForest.

ExtraTreesClassifier:

from sklearn.ensemble import ExtraTreesClassifierforest = ExtraTreesClassifier(n_estimators=250,              random_state=0)

forest.fit(X, y)
importances = forest.feature_importances_std = np.std([tree.feature_importances_ for tree in forest.estimators_],axis=0)
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f],   
importances[indices[f]]))

RandomForest:

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 100,    n_jobs = -1, oob_score = True, bootstrap = True,   random_state = 42)
rf.fit(X, y)

print('R^2 Training Score: {:.2f} \nOOB Score: {:.2f} '.format(rf.score(X, y), rf.oob_score_,))

results = pd.DataFrame(data=rf.feature_importances_, index=X.columns)
results.columns = ["Importance"]
results.sort_values(by=["Importance"], ascending=False)
importances = rf.feature_importances_

VIb. Permutation Method

The last thing that will see in this post is the permutation method. The goal is simple, the estimator will take a feature correlated with the target and shuffle it to measure how it is linked to the target. It determines how the estimator depends on this feature.

This technique benefits from being model agnostic and can be calculated many times with different permutations of the feature. (source: sklearn)

Here, I provide three implementations to use the permutation method. The first package is provided by rfpimp:

from sklearn.metrics import r2_score
from rfpimp import permutation_importances

def r2(rf, X_train, y_train): return r2_score(y_train, rf.predict(X_train))

perm_imp_rfpimp = permutation_importances(rf, X, y, r2)
importances = perm_imp_rfpimp.Importance

The eli5 library provides a version of PermutationImportance:

import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(rf, cv = None, refit = False, n_iter = 50).fit(X, y)
results = pd.DataFrame(data= perm.feature_importances_, index=X.columns)
results.columns = ["Importance"]
results.sort_values(by=["Importance"], ascending=False)
importances = perm.feature_importances_

permutation_importance via sklearn based on the R2 estimation:

from sklearn.linear_model import Ridge
from sklearn.inspection import permutation_importance

X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=42)

model = Ridge(alpha=1e-2).fit(X_train, y_train)r = permutation_importance(model, X_val, y_val,         n_repeats=30, random_state=42)

for i in r.importances_mean.argsort()[::-1]:
	if r.importances_mean[i] - 2 * r.importances_std[i] > 0:
    print(f"{r.importances_mean[i]:.3f}"                   f" +/- {r.importances_std[i]:.3f}")

Conclusion

Hi, you finally did it. You reach the end of this post about EDA and other techniques. You know how to do an EDA with different libraries. You know the different methods to select the best features of your dataset, the dimension reduction, and the feature importance. You are now armed to explore deeper your data and to represent it with visualization. I hope this post will help you and also the notebook.

References

[1] David M. Blei, Andrew Y. Ng and Michael I. Jordan, Latent Dirichlet Allocation (2003) Journal of Machine Learning Research 3 (2003) 993–1022

[2] van der Maaten, L.J.P.; Hinton, G.E. “Visualizing Data Using t-SNE” (Nov 2008). Journal of Machine Learning Research. 9: 2579–2605.

Note

The images provided in the post have been generated by the author or drawn by him.