Skip to main content

Encoding categorical variables

  • In the previous recipes, we have seen that some features are categorical variables (originally represented as either object or category data types). However, most machine learning algorithms work exclusively with numeric data. That is why we need to encode categorical features into a representation compatible with the models.
  • In this recipe, we use popular encoding approache: one-hot encoding
  • In this approach, for each category of a feature, we create a new column (sometimes called a dummy variable) with binary encoding to denote whether a particular row belongs to this category.

How to do it...

Execute the following steps to encode categorical variables.

  1. Import the libraries:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
  1. Use One Hot Encoder to encode a selected column:
one_hot_encoder = OneHotEncoder(sparse=False,
handle_unknown='error',
drop='first')
one_hot_transformer = ColumnTransformer([('one_hot', one_hot_encoder, CAT_FEATURES)])
one_hot_transformer.fit(X_train)

How it works...

  • First, we imported the necessary libraries.
  • In the second step, we selected the column we wanted to encode using one hot encoder, instantiated the one hot encoder, fitted it to the training data, and transformed both the training and the test data.

There's more...

  • Summing up, we should avoid label encoding when it introduces false order to the data, which can, in turn, lead to incorrect conclusions. Tree-based methods (decision trees, Random Forest, and so on) can work with categorical data and label encoding. However, for algorithms such as linear regression, models calculating distance metrics between features (k-means clustering, k-Nearest Neighbors, and so on) or Artificial Neural Networks (ANN), the natural representation is one-hot encoding.