Explore Data Analysis
- The second step, after loading the data, is to carry out Exploratory Data Analysis (EDA). By doing this, we get to know the data we are supposed to work with. Some insights we try to gather are:
- What kind of data do we actually have, and how should we treat different types?
- What is the distribution of the variables?
- Are there outliers in the data, and how can we treat them?
- Are any transformations required? For example, some models work better with (or require) normally distributed variables, so we might want to use techniques such as log transformation.
- Does the distribution vary per group (for example, gender or education level)?
- Do we have cases of missing data? How frequent are these, and in which variables?
- Is there a linear relationship between some variables (correlation)? Can we create new features using the existing set of variables? An example might be deriving hour/minute from a timestamp, day of the week from a date, and so on.
- Are there any variables that we can remove as they are not relevant for the analysis? An example might be a randomly generated customer identifier.
- EDA is extremely important in all data science projects, as it enables analysts to develop an understanding of the data, facilitates asking better questions, and makes it easier to pick modeling approaches suitable for the type of data being dealt with.
- In real-life cases, it makes sense to carry out univariate analysis (one feature at a time) for all relevant features to get a good understanding of them, and then proceed to multivariate analysis (comparing distributions per group, correlations, and so on). For brevity, we only show some popular approaches on selected features, but a deeper analysis is highly encouraged.
How to do it...
Execute the following steps to carry out EDA
- Import the libraries:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
- Get summary statistics for numeric variables
df.describe().transpose().round(2)
- Get summary statistics for categorical variables:
df.describe(include='object').transpose()
- Define and run a function for plotting the correlation heatmap:
def plot_correlation_matrix(corr_mat):
sns.set(style='white')
mask = np.zeros_like(corr_mat, dtype=np.bool_)
mask[np.triu_indices_from(mask)] = True
fig, ax = plt.subplots()
cmap = sns.diverging_palette(240, 10, n=9, as_cmap=True)
sns.heatmap(corr_mat, mask=mask, cmap=cmap, vmax=0.3, center=0,
square=True, linewidths=.5, cbar_kws={'shrink': .5}, ax=ax)
ax.set_title('Correlation Matrix', fontsize = 16)
sns.set(style='darkgrid')
corr_mat = df.select_dtypes(include='number').corr()
plot_correlation_matrix(corr_mat)
- In the plot, we can see that age seems to be uncorrelated to any of the other features.
- Investigate the distribution of the target variable per current_rating and region_or_state
ax = df.groupby('current_rating')['target'].value_counts(normalize=True).unstack().plot(kind='barh', stacked='True')
ax.set_title('Percentage of default per current rating', fontsize=16)
ax.legend(title='Default', bbox_to_anchor=(1,1))
ax = df.groupby('region_or_state')['target'].value_counts(normalize=True).unstack().plot(kind='barh', stacked='True')
ax.set_title('Percentage of default per current rating', fontsize=16)
ax.legend(title='Default', bbox_to_anchor=(1,1))
How it works...
- We started the analysis by using a very simple, yet powerful method of a pandas DataFrame —describe. It printed summary statistics, such as count, mean, min/max, and quartiles of all the numeric variables in the DataFrame. By inspecting these metrics, we could infer the value range of a certain feature, or whether the distribution was skewed (by looking at the difference between mean and median). Also, we could easily spot values outside the plausible range—for example, a negative or very small age.
- The count metric represents the number of non-null observations, so it is also a way to determine which numeric features contain missing values. Another way of investigating the presence of missing values is by running df.isnull().sum(). For more information on missing values, please see the Dealing with missing values recipe.
- In the third step, we added the include='object' argument to the describe method, to inspect the categorical features separately. The output was different from the numeric features: we could see the count, the number of unique categories, which one was the most frequent, and how many times it appeared in the dataset.
- In Step 4, we defined a function for plotting a heatmap representing the correlation matrix. In the function, we used a couple of operations to mask the upper triangular matrix, together with the diagonal (all correlations equal to 1 on the diagonal). This way, the output was easier to interpret.
- To calculate the correlations, we used the corr method of a DataFrame, which by default calculated Pearson's correlation coefficient. We did this only for numeric features. There are also methods for calculating the correlation of categorical features, but this is beyond the scope of this chapter. Inspecting correlations is crucial, especially when using machine learning algorithms that assume linear independence of the features (such as linear regression).
- In the last step, we investigated the distribution of the target variable (default) per current_rating and state_or_region.