Select columns using Pandas read_csv usecols parameter

The usecols parameter in pandas.read_csv() method allows you to selectively load only the columns of interest, reducing memory usage and speeding up the loading process.

Let’s see how we can use usecols parameter to select specific columns when importing data from a CSV file.

 

 

Using usecols with Column Indexes

You can use the column indexes with the usecols parameter to select specific columns:

import pandas as pd
data = """
Subscriber_ID,Plan_Type,Usage_Data,Monthly_Cost,Join_Date
1001,Prepaid,2.5GB,$30,2021-05-01
1002,Postpaid,Unlimited,$50,2021-03-12
1003,Prepaid,1GB,$20,2021-08-15
"""

# Using StringIO to simulate reading from a csv file
from io import StringIO
df = pd.read_csv(StringIO(data), usecols=[0, 2, 4])
print(df)

Output:

   Subscriber_ID Usage_Data    Join_Date
0           1001      2.5GB  2021-05-01
1           1002  Unlimited  2021-03-12
2           1003       1GB  2021-08-15

By specifying the column indexes [0, 2, 4], we’ve selectively loaded the Subscriber_ID, Usage_Data, and Join_Date columns from our sample dataset.

 

Using usecols with Column Names

You can use the column names to select specific columns from the CSV file by specifying them as an array to usecols parameter:

# Loading specific columns by their names
df_named = pd.read_csv(StringIO(data), usecols=['Subscriber_ID', 'Monthly_Cost'])
print(df_named)

Output:

   Subscriber_ID Monthly_Cost
0           1001         $30
1           1002         $50
2           1003         $20

 

Using usecols with Callable Functions

An incredibly useful feature of the usecols parameter is its compatibility with callable functions.

Suppose we want to load only columns that have the term ‘Date’ or ‘Cost’ in their names:

def filter_columns(col_name):
    return 'Date' in col_name or 'Cost' in col_name

# Using callable function with usecols
df_callable = pd.read_csv(StringIO(data), usecols=filter_columns)
print(df_callable)

Output:

  Monthly_Cost    Join_Date
0         $30  2021-05-01
1         $50  2021-03-12
2         $20  2021-08-15

The filter_columns function checks each column name for the presence of the terms ‘Date’ or ‘Cost’.

 

Selecting Columns by Data Type

You can use callable functions with usecols to filter columns based on their data type.

Let’s say we want to load only numeric columns only:

def filter_by_dtype(col_name):
    dtype_df = pd.read_csv(StringIO(data), nrows=1)
    return pd.api.types.is_numeric_dtype(dtype_df[col_name])

# Using callable function with usecols to select only numeric columns
df_dtype = pd.read_csv(StringIO(data), usecols=filter_by_dtype)
print(df_dtype)

Output:

   Subscriber_ID
0           1001
1           1002
2           1003

 

Excluding Columns

You can use the usecols parameter combined with callable functions to exclude specific columns.

Let’s say we want to exclude any column that has the word ‘Usage’ in its name:

def exclude_columns(col_name):
    return 'Usage' not in col_name

# Using callable function with usecols to exclude specific columns
df_exclude = pd.read_csv(StringIO(data), usecols=exclude_columns)
print(df_exclude)

Output:

   Subscriber_ID Plan_Type Monthly_Cost    Join_Date
0           1001   Prepaid         $30  2021-05-01
1           1002  Postpaid         $50  2021-03-12
2           1003   Prepaid         $20  2021-08-15

 

Combining Selection Criteria

Let’s say we want to load numeric columns and any column with the word ‘Plan’ in its name:

def combined_criteria(col_name):
    # Determine datatype of each column using a smaller dataframe
    dtype_df = pd.read_csv(StringIO(data), nrows=1)
    is_numeric = pd.api.types.is_numeric_dtype(dtype_df[col_name])
    return is_numeric or 'Plan' in col_name

# Using callable function with usecols for combined criteria
df_combined = pd.read_csv(StringIO(data), usecols=combined_criteria)
print(df_combined)

Output:

   Subscriber_ID Plan_Type
0           1001   Prepaid
1           1002  Postpaid
2           1003   Prepaid

The combined_criteria function evaluates both the datatype and the presence of the term ‘Plan’ in the column name.

 

Memory Usage Benefits of Using usecols

First, we’ll load the entire dataset and then measure the memory usage. Next, we’ll load only a subset using usecols and compare the memory footprints.

Loading the Entire Dataset

df_full = pd.read_csv(StringIO(data))
memory_full = df_full.memory_usage(deep=True).sum()

Output:

914

Loading Subset of Columns

df_subset = pd.read_csv(StringIO(data), usecols=['Subscriber_ID', 'Plan_Type'])
memory_subset = df_subset.memory_usage(deep=True).sum()

Output:

345

Comparing the two memory usage figures:

reduction_percentage = ((memory_full - memory_subset) / memory_full) * 100

Output:

Memory reduction: 62.65%

By loading only a subset of columns, we’ve reduced the memory footprint by approximately 62%.

For larger datasets, this difference can translate to saving gigabytes of memory, making your data processing tasks faster and more efficient.

 

Common Errors When Using usecols

Using the usecols parameter can sometimes lead to errors, let’s explore some of the most common errors and their causes.

Specifying a non-existent column name

When you provide a column name that isn’t present in the CSV file, pandas will throw an error.

# Trying to load a non-existent column
try:
    df_error = pd.read_csv(StringIO(data), usecols=['NonExistentColumn'])
except Exception as e:
    error_message = str(e)

print(error_message)

Output:

"Usecols do not match columns, columns expected but not found: ['NonExistentColumn']"

Solution:

Always ensure that the column names provided in usecols exist in your dataset.

Mixing column indexes and names

Combining column indexes with column names in the usecols parameter will result in an error.

# Mixing column indexes and names
try:
    df_mixed = pd.read_csv(StringIO(data), usecols=[0, 'Plan_Type'])
except Exception as e:
    mixed_error = str(e)
print(mixed_error)

Output:

'usecols' must either be list-like of all strings, all unicode, all integers or a callable.

Solution:

Either use column names or indexes, but not both together.

Leave a Reply

Your email address will not be published. Required fields are marked *