Loading Data and Managing Data Types
In this recipe, we show how to load a dataset into Python. In order to show the entire pipeline including working with messy data. We apply some transformation to the original dataset. For more information on applied changes, please refer to the accompanying GitHub repositoty
How to do it
Execute the following steps to load a dataset into Python
- Import the libraries:
import pandas as pd
- Preview a CSV file:
! head -n 5 'data.csv'
- Load the data from the CSV file:
df = pd.read_csv('data.csv', low_memory=False)
The DataFrame has 705695 rows and 45 columns
- Separate the features from the target
X = df.copy()
y = X.pop('target')
After running this block of code, x no longer contains the target columns
How it works...
-
In Step 1, we imported the pandas library.
-
In Step 2, we used the bash command head (with an additional argument -n 5) to preview the first five rows of the CSV file. This can come in handy when we want to determine what kind of data we are dealing with, without opening a potentially large text file. Inspecting a few rows of the dataset raised the following questions:
- What is the separator?
- Are there any special characters that need to be escaped?
- Are there any missing values and, if so, what scheme (NA, empty string, nan and so on) is used for them?
- What variables types (floats, integers, strings) are in the file? Based on that information, we can try to optimize memory usage while loading the file.
-
In Step 3, we loaded the CSV file by using the pd.read_csv function. When doing so, we indicated that the first column (zero-indexed) contained the index, and empty strings should be interpreted as missing values.
-
In the last step, we separated the features from the target by using the pop method. It assigned the given column to the new variable, while removing it from the source DataFrame
-
In the following, we provide a simplified description of the variables:
-
The target variable indicates whether the customer defaulted on the payment in the following month.
There's more...
- In general, pandas tries to load and store data efficiently. It automatically assigns data types (which we can inspect by calling the dtypes method of a pandas DataFrame). However, there are some tricks that can lead to much better memory allocation, which definitely make working with larger tables (in hundreds of MBs, or even GBs) easier.
- We start by inspecting the data types in our DataFrame:
df.dtypes
- In the preceding image, we see a few distinct data types: floats (floating-point numbers, such as 3.42), integers, and objects. The last ones are the pandas representation of string variables. The number next to float and int indicates how many bits this type uses to represent a particular value. The default types use 64 bits of memory.
- We also leverage a special type called category. The underlying idea is that string variables are encoded as integers, and pandas uses a special mapping dictionary to decode them back into their original form. This is especially useful when dealing with a limited number of distinct string values (for example, current_rating, region_or_state, and so on).
- We define a function, which we use to inspect how much memory (in MBs) a DataFrame actually uses:
def get_df_memory_usage (df, top_columns=5):
print('Memory usage ----')
memory_per_column = df.memory_usage(deep=True) / 1024 ** 2
print(f'Top {top_columns} columns by memory (MB):')
print(memory_per_column.sort_values(ascending=False) \
.head(top_columns))
print(f'Total size: {memory_per_column.sum():.4f} MB')
- Inspect the size of the initial DataFrame (just as we loaded it from the source CSV file):
get_df_memory_usage(df)
- In total, the DataFrame uses ~311 MB of memory, with ~25% of that used by columns with the object data type. This is still very small in terms of the current machines' capabilities, however, the memory-saving principles we show here apply just as well to DataFrames measured in gigabytes.
- We create a copy of the original DataFrame and change the types of the object columns to category. We first select the names of the columns with variables of the object type (using the select_dtypes method), and then convert them to a categorical type using the astype method.
- Convert object columns to categorical:
df_cat = df.copy()
object_columns = df_cat.select_dtypes(include='object').columns
df_cat[object_columns] = df_cat[object_columns].astype('category')
- Inspect the size of the DataFrame:
get_df_memory_usage(df_cat)
- With this simple transformation, we managed to reduce the size (memory-wise) of the DataFrame by ~80%. We could also downcast the integer columns (currently using the int64 type) to something much smaller memory-wise, but for that, we would need to inspect the min and max values for each column and determine which type would be the best fit. We do not do this here; however, the process of using astype is the same.