Select columns using Pandas read_csv usecols parameter

The usecols parameter in pandas.read_csv() method allows you to selectively load only the columns of interest, reducing memory usage and speeding up the loading process.

Let’s see how we can use usecols parameter to select specific columns when importing data from a CSV file.

Table of Contents hide

1 Using usecols with Column Indexes
2 Using usecols with Column Names
3 Using usecols with Callable Functions
4 Selecting Columns by Data Type
5 Excluding Columns
6 Combining Selection Criteria
7 Memory Usage Benefits of Using usecols
- 7.1 Loading the Entire Dataset
- 7.2 Loading Subset of Columns
8 Common Errors When Using usecols
- 8.1 Specifying a non-existent column name
- 8.2 Mixing column indexes and names

Using usecols with Column Indexes

You can use the column indexes with the usecols parameter to select specific columns:

import pandas as pd
data = """
Subscriber_ID,Plan_Type,Usage_Data,Monthly_Cost,Join_Date
1001,Prepaid,2.5GB,$30,2021-05-01
1002,Postpaid,Unlimited,$50,2021-03-12
1003,Prepaid,1GB,$20,2021-08-15
"""

# Using StringIO to simulate reading from a csv file
from io import StringIO
df = pd.read_csv(StringIO(data), usecols=[0, 2, 4])
print(df)

Output:

   Subscriber_ID Usage_Data    Join_Date
0           1001      2.5GB  2021-05-01
1           1002  Unlimited  2021-03-12
2           1003       1GB  2021-08-15

By specifying the column indexes [0, 2, 4], we’ve selectively loaded the Subscriber_ID, Usage_Data, and Join_Date columns from our sample dataset.

Using usecols with Column Names

You can use the column names to select specific columns from the CSV file by specifying them as an array to usecols parameter:

# Loading specific columns by their names
df_named = pd.read_csv(StringIO(data), usecols=['Subscriber_ID', 'Monthly_Cost'])
print(df_named)

Output:

   Subscriber_ID Monthly_Cost
0           1001         $30
1           1002         $50
2           1003         $20

Using usecols with Callable Functions

An incredibly useful feature of the usecols parameter is its compatibility with callable functions.

Suppose we want to load only columns that have the term ‘Date’ or ‘Cost’ in their names:

def filter_columns(col_name):
    return 'Date' in col_name or 'Cost' in col_name

# Using callable function with usecols
df_callable = pd.read_csv(StringIO(data), usecols=filter_columns)
print(df_callable)

Output:

  Monthly_Cost    Join_Date
0         $30  2021-05-01
1         $50  2021-03-12
2         $20  2021-08-15

The filter_columns function checks each column name for the presence of the terms ‘Date’ or ‘Cost’.

Selecting Columns by Data Type

You can use callable functions with usecols to filter columns based on their data type.

Let’s say we want to load only numeric columns only:

def filter_by_dtype(col_name):
    dtype_df = pd.read_csv(StringIO(data), nrows=1)
    return pd.api.types.is_numeric_dtype(dtype_df[col_name])

# Using callable function with usecols to select only numeric columns
df_dtype = pd.read_csv(StringIO(data), usecols=filter_by_dtype)
print(df_dtype)

Output:

   Subscriber_ID
0           1001
1           1002
2           1003

Excluding Columns

You can use the usecols parameter combined with callable functions to exclude specific columns.

Let’s say we want to exclude any column that has the word ‘Usage’ in its name:

def exclude_columns(col_name):
    return 'Usage' not in col_name

# Using callable function with usecols to exclude specific columns
df_exclude = pd.read_csv(StringIO(data), usecols=exclude_columns)
print(df_exclude)

Output:

   Subscriber_ID Plan_Type Monthly_Cost    Join_Date
0           1001   Prepaid         $30  2021-05-01
1           1002  Postpaid         $50  2021-03-12
2           1003   Prepaid         $20  2021-08-15

Combining Selection Criteria

Let’s say we want to load numeric columns and any column with the word ‘Plan’ in its name:

def combined_criteria(col_name):
    # Determine datatype of each column using a smaller dataframe
    dtype_df = pd.read_csv(StringIO(data), nrows=1)
    is_numeric = pd.api.types.is_numeric_dtype(dtype_df[col_name])
    return is_numeric or 'Plan' in col_name

# Using callable function with usecols for combined criteria
df_combined = pd.read_csv(StringIO(data), usecols=combined_criteria)
print(df_combined)

Output:

   Subscriber_ID Plan_Type
0           1001   Prepaid
1           1002  Postpaid
2           1003   Prepaid

The combined_criteria function evaluates both the datatype and the presence of the term ‘Plan’ in the column name.

Memory Usage Benefits of Using usecols

First, we’ll load the entire dataset and then measure the memory usage. Next, we’ll load only a subset using usecols and compare the memory footprints.

Loading the Entire Dataset

df_full = pd.read_csv(StringIO(data))
memory_full = df_full.memory_usage(deep=True).sum()

Output:

Loading Subset of Columns

df_subset = pd.read_csv(StringIO(data), usecols=['Subscriber_ID', 'Plan_Type'])
memory_subset = df_subset.memory_usage(deep=True).sum()

Output:

Comparing the two memory usage figures:

reduction_percentage = ((memory_full - memory_subset) / memory_full) * 100

Output:

Memory reduction: 62.65%

By loading only a subset of columns, we’ve reduced the memory footprint by approximately 62%.

For larger datasets, this difference can translate to saving gigabytes of memory, making your data processing tasks faster and more efficient.

Common Errors When Using usecols

Using the usecols parameter can sometimes lead to errors, let’s explore some of the most common errors and their causes.

Specifying a non-existent column name

When you provide a column name that isn’t present in the CSV file, pandas will throw an error.

# Trying to load a non-existent column
try:
    df_error = pd.read_csv(StringIO(data), usecols=['NonExistentColumn'])
except Exception as e:
    error_message = str(e)

print(error_message)

Output:

"Usecols do not match columns, columns expected but not found: ['NonExistentColumn']"

Solution:

Always ensure that the column names provided in usecols exist in your dataset.

Mixing column indexes and names

Combining column indexes with column names in the usecols parameter will result in an error.

# Mixing column indexes and names
try:
    df_mixed = pd.read_csv(StringIO(data), usecols=[0, 'Plan_Type'])
except Exception as e:
    mixed_error = str(e)
print(mixed_error)

Output:

'usecols' must either be list-like of all strings, all unicode, all integers or a callable.

Solution:

Either use column names or indexes, but not both together.

Mokhtar Ebrahim

Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.

Using usecols with Column Indexes

Using usecols with Column Names

Using usecols with Callable Functions

Selecting Columns by Data Type

Excluding Columns

Combining Selection Criteria

Memory Usage Benefits of Using usecols

Loading the Entire Dataset

Loading Subset of Columns

Common Errors When Using usecols

Specifying a non-existent column name

Mixing column indexes and names

Related posts

Leave a Reply Cancel reply