Pandas where() method: Filtering with Conditions
The where
method in Pandas allows you to filter DataFrame or Series based on conditions, akin to SQL’s WHERE clause.
Have you ever found yourself needing to replace certain values in a DataFrame based on a specific condition, or perhaps wanting to mask data that doesn’t meet certain criteria?
The where
method is used to do such transformation tasks.
- 1 Pandas where Syntax and Parameters
- 2 Filter DataFrame using where method
- 3 Return Type of the where Method
- 4 Creating Boolean Masks with Conditions
- 5 Using equal and not equal
- 6 Combining Multiple Conditions
- 7 Using Functions as Conditions
- 8 In-place Modification
- 9 Handling NaN Values
- 10 Performance of where vs. query or boolean indexing
- 11 Nested where Calls
- 12 Resource
Pandas where
Syntax and Parameters
The where
method syntax looks like this:
DataFrame.where(cond, other=nan, inplace=False, axis=None, level=None)
Let’s break down these parameters:
cond
: This is a condition that, when satisfied, retains the original value. If not satisfied, the value will be replaced by the one specified in theother
parameter.other
: Values to replace when the condition is false. By default, it is NaN.inplace
: Whether to modify the calling object directly or to return a new one. Default is False.axis
: An optional parameter for alignment purposes. Can be either the index (0) or the columns (1).level
: For DataFrames with multi-level index, level where the operation should be applied.
Filter DataFrame using where
method
Let’s perform basic filtering based on scalar conditions using where
method:
import pandas as pd data = {'Scores': [85, 90, 78, 88, 76, 95, 89]} df = pd.DataFrame(data) high_scores = df.where(df['Scores'] > 80) print(high_scores)
Output:
Scores 0 85.0 1 90.0 2 NaN 3 88.0 4 NaN 5 95.0 6 89.0
From the output, you can see that the scores less than or equal to 80 are replaced by NaN.
In a coming section, we’ll learn how to handle these NaN values.
Only scores above 80 are retained in the high_scores
DataFrame.
Return Type of the where
Method
When where
method is applied on a DataFrame, it returns a DataFrame; when used on a Series, it returns a Series.
Let’s explore this concept with some examples:
Working with a DataFrame
import pandas as pd data = { 'Age': [25, 30, 35, 40], 'Salary': [50000, 60000, 70000, 80000] } df = pd.DataFrame(data) young_employees = df.where(df['Age'] < 35) print(young_employees)
Output:
Age Salary 0 25.0 50000.0 1 30.0 60000.0 2 NaN NaN 3 NaN NaN
Notice how the entire row is masked (replaced by NaNs) if the condition isn’t met, even if only one column’s condition was specified.
Working with a Series
If we apply the where
method to a single column (Series) of the DataFrame:
young_ages = df['Age'].where(df['Age'] < 35) print(young_ages)
Output:
0 25.0 1 30.0 2 NaN 3 NaN Name: Age, dtype: float64
Here, the output is a Series, not a DataFrame, because we applied the method to a single column.
Creating Boolean Masks with Conditions
The foundation of the where
method’s functionality lies in creating Boolean masks based on specified conditions.
These masks are essentially arrays or series of True
and False
values, representing whether each element satisfies the given condition.
Given a DataFrame:
import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 28, 30, 23] } df = pd.DataFrame(data)
To create a Boolean mask for ages less than 28:
age_mask = df['Age'] < 28 print(age_mask)
Output:
0 True 1 False 2 False 3 True Name: Age, dtype: bool
This mask indicates that the first and last entries satisfy the condition.
Applying a Boolean Mask using where
The mask can then be used directly with the where
method:
younger_ages = df['Age'].where(age_mask) print(younger_ages)
Output:
0 25.0 1 NaN 2 NaN 3 23.0 Name: Age, dtype: float64
Using equal and not equal
Comparison operators allow us to create conditions to filter our data as needed.
Let’s see how these operators function within the realm of Pandas.
For a sample DataFrame:
import pandas as pd data = { 'Grade': [85, 90, 78, 88, 76, 95, 89] } df = pd.DataFrame(data)
Using ==
(Equality):
To filter only the grades that are exactly 88:
exact_grade = df['Grade'].where(df['Grade'] == 88) print(exact_grade)
Output:
0 NaN 1 NaN 2 NaN 3 88.0 4 NaN 5 NaN 6 NaN Name: Grade, dtype: float64
Inequality and Not Equal
The !=
(Not Equal) operator is crucial when you want to exclude specific values:
# All grades that are not 88 not_eighty_eight = df['Grade'].where(df['Grade'] != 88) print(not_eighty_eight)
Output:
0 85.0 1 90.0 2 78.0 3 NaN 4 76.0 5 95.0 6 89.0 Name: Grade, dtype: float64
Combining Multiple Conditions
By chaining multiple conditions using &
, |
, and ~
operators, you can define more sophisticated data filters in Pandas.
For this, let’s continue with our grades example:
import pandas as pd data = { 'Grade': [85, 90, 78, 88, 76, 95, 89], 'Status': ['Pass', 'Pass', 'Fail', 'Pass', 'Fail', 'Pass', 'Pass'] } df = pd.DataFrame(data)
Using & (AND) Operator
To filter grades greater than 80 AND with a status of ‘Pass’:
high_passing_grades = df.where((df['Grade'] > 80) & (df['Status'] == 'Pass')) print(high_passing_grades)
Output:
Grade Status 0 85.0 Pass 1 90.0 Pass 2 NaN NaN 3 88.0 Pass 4 NaN NaN 5 95.0 Pass 6 89.0 Pass
Using | (OR) Operator
To filter grades greater than 90 OR with a status of ‘Fail’:
high_or_fail = df.where((df['Grade'] > 90) | (df['Status'] == 'Fail')) print(high_or_fail)
Output:
Grade Status 0 NaN NaN 1 NaN NaN 2 78.0 Fail 3 NaN NaN 4 76.0 Fail 5 95.0 Pass 6 NaN NaN
Using ~ (NOT) Operator
To filter grades NOT greater than 90:
not_high_grades = df.where(~(df['Grade'] > 90)) print(not_high_grades)
Output:
Grade Status 0 85.0 Pass 1 90.0 Pass 2 78.0 Fail 3 88.0 Pass 4 76.0 Fail 5 NaN NaN 6 89.0 Pass
Remember that when combining conditions, it’s crucial to wrap each condition in parentheses.
Using Functions as Conditions
A powerful feature of the where
method is the ability to use callables (like functions) as conditions for filtering.
Here’s how to use callables within the where
method:
Let’s take the following DataFrame as an example:
import pandas as pd data = { 'Grade': [85, 90, 78, 88, 76, 95, 89] } df = pd.DataFrame(data)
Let’s say we want to retain grades that are odd numbers:
def is_odd(num): return num % 2 == 1 odd_grades = df['Grade'].where(is_odd) print(odd_grades)
Output:
0 85.0 1 NaN 2 NaN 3 NaN 4 NaN 5 95.0 6 89.0 Name: Grade, dtype: float64
In-place Modification
By default, the where
method doesn’t alter the original DataFrame or Series. Instead, it returns a new object with the modifications:
By setting the inplace
parameter to True
, the original object is modified:
df.where(df['Grade'] > 85, inplace=True)
Now, df
directly reflects the changes, with grades above 85 retained and others replaced by NaN. There’s no new object returned; all modifications are done on df
.
Note: When working with massive DataFrames, modifying in-place can be more memory-efficient since it doesn’t create a new object.
Handling NaN Values
As we saw in the previous examples, entries that don’t meet the specified condition get replaced with NaN
values.
Let’s say you want entries not meeting your condition to be replaced by a static value, like -1 instead of NaN.
import pandas as pd data = { 'Grade': [85, 90, 78, 88, 76, 95, 89] } df = pd.DataFrame(data) graded_df = df['Grade'].where(df['Grade'] > 85, other=-1)
Output:
0 -1 1 90 2 -1 3 88 4 -1 5 95 6 89 Name: Grade, dtype: int64
Using Different Values for Different Entries
What if you wish to replace each non-conforming entry with a different value? The other
parameter can handle arrays or Series:
replacement_values = [10, 20, 30, 40, 50, 60, 70] graded_df = df['Grade'].where(df['Grade'] > 85, other=replacement_values)
Output:
0 10 1 90 2 30 3 88 4 50 5 95 6 89 Name: Grade, dtype: int64
Using Functions for Dynamic Replacement
For more intricate replacement logic, you can also use functions:
def custom_replacement(grade): return grade / 2 graded_df = df['Grade'].where(df['Grade'] > 85, other=custom_replacement)
Output:
0 42.5 1 90.0 2 39.0 3 88.0 4 38.0 5 95.0 6 89.0 Name: Grade, dtype: float64
The other
parameter in the where
method provides an added layer of customization.
Performance of where vs. query or boolean indexing
When filtering data, three popular choices arise: the where
method, the query
method, and direct boolean indexing.
Let’s measure the performance of these methods.
First, we need to create a sample DataFrame:
import pandas as pd import numpy as np np.random.seed(42) df = pd.DataFrame({ 'A': np.random.randint(1, 1000, 1000000), 'B': np.random.randint(1, 1000, 1000000) })
Timing the where Method
def using_where(df): return df.where(df['A'] > 50)
Timing Boolean Indexing
def using_boolean_indexing(df): return df[df['A'] > 50]
Timing the query Method
def using_query(df): return df.query('A > 50')
Now, let’s time these functions using timeit:
import timeit n_repeat = 3 n_iter = 10 where_time = timeit.timeit('using_where(df)', globals=globals(), number=n_iter) / n_iter bool_idx_time = timeit.timeit('using_boolean_indexing(df)', globals=globals(), number=n_iter) / n_iter query_time = timeit.timeit('using_query(df)', globals=globals(), number=n_iter) / n_iter print(f"Average time using 'where' method over {n_iter} iterations: {where_time:.6f} seconds") print(f"Average time using boolean indexing over {n_iter} iterations: {bool_idx_time:.6f} seconds") print(f"Average time using 'query' method over {n_iter} iterations: {query_time:.6f} seconds")
Output:
Average time using 'where' method over 10 iterations: 0.061809 seconds Average time using boolean indexing over 10 iterations: 0.025589 seconds Average time using 'query' method over 10 iterations: 0.029168 seconds
From the above result, boolean indexing is definitely the faster method.
Nested where
Calls
If you want to apply multiple conditions sequentially. One way to achieve this is by chaining or nesting where
calls.
Consider a simple dataset:
import pandas as pd data = { 'Grade': [85, 90, 78, 88, 76, 95, 89], 'Subject': ['Math', 'English', 'History', 'Math', 'English', 'History', 'Math'] } df = pd.DataFrame(data)
Imagine you want to filter data for entries with a grade above 85 and the subject being ‘Math’.
You can chain where
calls:
filtered_df = df.where(df['Grade'] > 85).where(df['Subject'] == 'Math') print(filtered_df)
Output:
Grade Subject 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 88.0 Math 4 NaN NaN 5 NaN NaN 6 89.0 Math
Nested where
with the other
Parameter
You can combine the other
parameter for intricate replacements:
replaced_df = df['Grade'].where(df['Grade'] > 85, other=-1).where(df['Subject'] == 'Math', other=-2) print(replaced_df)
Output:
0 -1 1 -2 2 -2 3 88 4 -2 5 -2 6 89 Name: Grade, dtype: int64
Here, grades below 85 are replaced with -1, and any row not being ‘Math’ is then further replaced with -2.
Resource
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.