Skip to main content

Dealing with missing values

  • In most real-life cases, we do not work with clean, complete data. One of the potential problems we are bound to encounter is that of missing values. We can categorize missing values by the reason they occur:
    • Missing completely at random (MCAR)—The reason for the missing data is unrelated to the rest of the data. An example could be a respondent accidentally missing a question in a survey.
    • Missing at random (MAR)—The missingness of the data can be inferred from data in another column(-s). For example, the missingness to a response to a certain survey question can be to some extent determined conditionally by other factors such as gender, age, lifestyle, and so on.
    • Missing not at random (MNAR)—When there is some underlying reason for the missing values. For example, people with very high incomes tend to be hesitant about revealing it.
    • Structurally missing data—Often a subset of MNAR, the data is missing because of a logical reason. For example, when a variable representing the age of a spouse is missing, we can infer that a given person has no spouse.
  • While some machine learning algorithms can account for missing data (for example, XG Boosting can treat missing values as a separate and unique value), many algorithms cannot, or their popular implementations (such as the ones in scikit-learn) do not incorporate this functionality.
  • Some popular solutions include:
    • Drop observations with one, or more, missing values—While this is the easiest approach, it is not always a good one, especially in the case of small datasets. Even if there is only a small fraction of missing values per feature, they do not necessarily occur for the same observations (rows), so the actual number of rows to remove can be much higher. Additionally, in the case of data missing not at random, removing such observations from the analysis can introduce bias into the results.
    • Replace the missing values with a value far outside the possible range, so that algorithms such as decision trees can treat it as a special value, indicating missing data.
    • In the case of dealing with time series, we can use forward-filling (take the last-known observation before the missing one), backward-filling (take the first-known observation after the missing one), or interpolation (linear or more advanced).
    • Replace the missing values with an aggregate metric—for continuous data, we can use the mean (when there are no clear outliers in the data) or median (when there are outliers). In the case of categorical variables, we can use mode (the most common value in the set). A potential disadvantage of mean/median imputation is the reduction of variance in the dataset.
    • Replace the missing values with aggregate metrics calculated per group—for example, when dealing with body-related metrics, we can calculate the mean or median per gender, to more accurately replace the missing data.
    • ML-based approaches—We can treat the considered feature as a target, and use complete cases to train a model and predict values for the missing observations.

How to do it...

Execute the following steps to investigate and deal with missing values in the dataset.

  1. Import the libraries
import pandas as pd
import missingno
from sklearn.impute import SimpleImputer
  1. Inspect the information about the DataFrame:
X.info()
  1. Visualize the nullity of the DataFrame:
missingno.matrix(X)
  • The white bars visible in the columns represent missing values. The line on the right side of the plot describes the shape of data completeness. The two numbers indicate the maximum and minimum nullity in the dataset (there are 44 columns in total, and the row with the most missing values contains 2—hence the 20).
  1. Define columns with missing values per data type:
NUM_FEATURES = df.select_dtypes(include=[np.number]).columns[df.select_dtypes(include=[np.number]).isnull().any()].tolist()
CAT_FEATURES = df.select_dtypes(include='object').columns[df.select_dtypes(include='object').isnull().any()].tolist()
  1. Impute numerical features:
for col in NUM_FEATURES:
num_imputer = SimpleImputer(strategy='median')
num_imputer.fit(X_train[[col]])
X_train.loc[:, col] = num_imputer.transform(X_train[[col]])
X_test.loc[:, col] = num_imputer.transform(X_test[[col]])
  1. Impute categorical features:
for col in CAT_FEATURES:
cat_imputer = SimpleImputer(strategy='most_frequent')
cat_imputer.fit(X_train[[col]])
X_train.loc[:, col] = cat_imputer.transform(X_train[[col]])
X_test.loc[:, col] = cat_imputer.transform(X_test[[col]])
  1. Verify that there are no missing values:
X_train.info()
  • We can inspect the output, to confirm that there are no missing values in x

How it works...

  • In Step 1, we imported the required libraries.
  • Then, we used the info method of a pandas DataFrame to view information about the columns, such as their type and the number of non-null observations. Another possible way of inspecting the number of missing values per column is to run X.isnull().sum().
  • In Step 3, we visualized the nullity of the DataFrame, with the help of the missingno library.
  • In Step 4, we defined lists containing features we wanted to impute, one list per data type. The reason for this is the fact that numeric features are imputed using different strategies than the categorical features. For basic imputation, we used the SimpleImputer class from scikit-learn.
  • In Step 5, we iterated over the numerical features (in this case, only the age feature), and used the median to replace the missing values. Inside the loop, we defined the imputer object with the correct strategy (median), fitted it to the given column of the training data, and transformed both the training and test data. This way, the median was estimated by only using the training data, preventing potential data leakage.
  • Step 6 is analogical to Step 5, where we used the same approach to iterate over categorical columns. The difference lies in the selected strategy—we used the most frequent value (most_frequent) in the given column. This strategy can be used for both categorical and numerical features; in the latter case, this is the mode.

There's more...

  • In this recipe, we mentioned how to impute missing values. Approaches such as replacing the missing value with one large value or the mean/median/mode are called single imputation approaches, as they replace missing values with one specific value. However, there are also multiple imputation approaches, and one of those is Multiple Imputation by Chained Equations (MICE). In short, the algorithm runs multiple regression models, and each missing value is determined conditionally on the basis of the non-missing data points. A potential benefit of using an ML-based approach to imputation is the reduction of bias introduced by single imputation.
  • The MICE algorithm is available in scikit-learn, under the name of IterativeImputer, in the impute module