Select columns using Pandas read_csv usecols parameter
The usecols
parameter in pandas.read_csv()
method allows you to selectively load only the columns of interest, reducing memory usage and speeding up the loading process.
Let’s see how we can use usecols
parameter to select specific columns when importing data from a CSV file.
Using usecols with Column Indexes
You can use the column indexes with the usecols
parameter to select specific columns:
import pandas as pd data = """ Subscriber_ID,Plan_Type,Usage_Data,Monthly_Cost,Join_Date 1001,Prepaid,2.5GB,$30,2021-05-01 1002,Postpaid,Unlimited,$50,2021-03-12 1003,Prepaid,1GB,$20,2021-08-15 """ # Using StringIO to simulate reading from a csv file from io import StringIO df = pd.read_csv(StringIO(data), usecols=[0, 2, 4]) print(df)
Output:
Subscriber_ID Usage_Data Join_Date 0 1001 2.5GB 2021-05-01 1 1002 Unlimited 2021-03-12 2 1003 1GB 2021-08-15
By specifying the column indexes [0, 2, 4], we’ve selectively loaded the Subscriber_ID
, Usage_Data
, and Join_Date
columns from our sample dataset.
Using usecols with Column Names
You can use the column names to select specific columns from the CSV file by specifying them as an array to usecols
parameter:
# Loading specific columns by their names df_named = pd.read_csv(StringIO(data), usecols=['Subscriber_ID', 'Monthly_Cost']) print(df_named)
Output:
Subscriber_ID Monthly_Cost 0 1001 $30 1 1002 $50 2 1003 $20
Using usecols with Callable Functions
An incredibly useful feature of the usecols
parameter is its compatibility with callable functions.
Suppose we want to load only columns that have the term ‘Date’ or ‘Cost’ in their names:
def filter_columns(col_name): return 'Date' in col_name or 'Cost' in col_name # Using callable function with usecols df_callable = pd.read_csv(StringIO(data), usecols=filter_columns) print(df_callable)
Output:
Monthly_Cost Join_Date 0 $30 2021-05-01 1 $50 2021-03-12 2 $20 2021-08-15
The filter_columns
function checks each column name for the presence of the terms ‘Date’ or ‘Cost’.
Selecting Columns by Data Type
You can use callable functions with usecols
to filter columns based on their data type.
Let’s say we want to load only numeric columns only:
def filter_by_dtype(col_name): dtype_df = pd.read_csv(StringIO(data), nrows=1) return pd.api.types.is_numeric_dtype(dtype_df[col_name]) # Using callable function with usecols to select only numeric columns df_dtype = pd.read_csv(StringIO(data), usecols=filter_by_dtype) print(df_dtype)
Output:
Subscriber_ID 0 1001 1 1002 2 1003
Excluding Columns
You can use the usecols
parameter combined with callable functions to exclude specific columns.
Let’s say we want to exclude any column that has the word ‘Usage’ in its name:
def exclude_columns(col_name): return 'Usage' not in col_name # Using callable function with usecols to exclude specific columns df_exclude = pd.read_csv(StringIO(data), usecols=exclude_columns) print(df_exclude)
Output:
Subscriber_ID Plan_Type Monthly_Cost Join_Date 0 1001 Prepaid $30 2021-05-01 1 1002 Postpaid $50 2021-03-12 2 1003 Prepaid $20 2021-08-15
Combining Selection Criteria
Let’s say we want to load numeric columns and any column with the word ‘Plan’ in its name:
def combined_criteria(col_name): # Determine datatype of each column using a smaller dataframe dtype_df = pd.read_csv(StringIO(data), nrows=1) is_numeric = pd.api.types.is_numeric_dtype(dtype_df[col_name]) return is_numeric or 'Plan' in col_name # Using callable function with usecols for combined criteria df_combined = pd.read_csv(StringIO(data), usecols=combined_criteria) print(df_combined)
Output:
Subscriber_ID Plan_Type 0 1001 Prepaid 1 1002 Postpaid 2 1003 Prepaid
The combined_criteria
function evaluates both the datatype and the presence of the term ‘Plan’ in the column name.
Memory Usage Benefits of Using usecols
First, we’ll load the entire dataset and then measure the memory usage. Next, we’ll load only a subset using usecols
and compare the memory footprints.
Loading the Entire Dataset
df_full = pd.read_csv(StringIO(data)) memory_full = df_full.memory_usage(deep=True).sum()
Output:
914
Loading Subset of Columns
df_subset = pd.read_csv(StringIO(data), usecols=['Subscriber_ID', 'Plan_Type']) memory_subset = df_subset.memory_usage(deep=True).sum()
Output:
345
Comparing the two memory usage figures:
reduction_percentage = ((memory_full - memory_subset) / memory_full) * 100
Output:
Memory reduction: 62.65%
By loading only a subset of columns, we’ve reduced the memory footprint by approximately 62%.
For larger datasets, this difference can translate to saving gigabytes of memory, making your data processing tasks faster and more efficient.
Common Errors When Using usecols
Using the usecols
parameter can sometimes lead to errors, let’s explore some of the most common errors and their causes.
Specifying a non-existent column name
When you provide a column name that isn’t present in the CSV file, pandas
will throw an error.
# Trying to load a non-existent column try: df_error = pd.read_csv(StringIO(data), usecols=['NonExistentColumn']) except Exception as e: error_message = str(e) print(error_message)
Output:
"Usecols do not match columns, columns expected but not found: ['NonExistentColumn']"
Solution:
Always ensure that the column names provided in usecols
exist in your dataset.
Mixing column indexes and names
Combining column indexes with column names in the usecols
parameter will result in an error.
# Mixing column indexes and names try: df_mixed = pd.read_csv(StringIO(data), usecols=[0, 'Plan_Type']) except Exception as e: mixed_error = str(e) print(mixed_error)
Output:
'usecols' must either be list-like of all strings, all unicode, all integers or a callable.
Solution:
Either use column names or indexes, but not both together.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.