Pandas where() method: Filtering with Conditions

The where method in Pandas allows you to filter DataFrame or Series based on conditions, akin to SQL’s WHERE clause.

Have you ever found yourself needing to replace certain values in a DataFrame based on a specific condition, or perhaps wanting to mask data that doesn’t meet certain criteria?

The where method is used to do such transformation tasks.

 

 

Pandas where Syntax and Parameters

The where method syntax looks like this:

DataFrame.where(cond, other=nan, inplace=False, axis=None, level=None)

Let’s break down these parameters:

  • cond: This is a condition that, when satisfied, retains the original value. If not satisfied, the value will be replaced by the one specified in the other parameter.
  • other: Values to replace when the condition is false. By default, it is NaN.
  • inplace: Whether to modify the calling object directly or to return a new one. Default is False.
  • axis: An optional parameter for alignment purposes. Can be either the index (0) or the columns (1).
  • level: For DataFrames with multi-level index, level where the operation should be applied.

 

Filter DataFrame using where method

Let’s perform basic filtering based on scalar conditions using where method:

import pandas as pd
data = {'Scores': [85, 90, 78, 88, 76, 95, 89]}
df = pd.DataFrame(data)
high_scores = df.where(df['Scores'] > 80)
print(high_scores)

Output:

   Scores
0    85.0
1    90.0
2     NaN
3    88.0
4     NaN
5    95.0
6    89.0

From the output, you can see that the scores less than or equal to 80 are replaced by NaN.

In a coming section, we’ll learn how to handle these NaN values.

Only scores above 80 are retained in the high_scores DataFrame.

 

Return Type of the where Method

When where method is applied on a DataFrame, it returns a DataFrame; when used on a Series, it returns a Series.

Let’s explore this concept with some examples:

Working with a DataFrame

import pandas as pd
data = {
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 70000, 80000]
}
df = pd.DataFrame(data)
young_employees = df.where(df['Age'] < 35)
print(young_employees)

Output:

    Age   Salary
0  25.0  50000.0
1  30.0  60000.0
2   NaN      NaN
3   NaN      NaN

Notice how the entire row is masked (replaced by NaNs) if the condition isn’t met, even if only one column’s condition was specified.

Working with a Series

If we apply the where method to a single column (Series) of the DataFrame:

young_ages = df['Age'].where(df['Age'] < 35)
print(young_ages)

Output:

0    25.0
1    30.0
2     NaN
3     NaN
Name: Age, dtype: float64

Here, the output is a Series, not a DataFrame, because we applied the method to a single column.

 

Creating Boolean Masks with Conditions

The foundation of the where method’s functionality lies in creating Boolean masks based on specified conditions.

These masks are essentially arrays or series of True and False values, representing whether each element satisfies the given condition.

Given a DataFrame:

import pandas as pd
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 28, 30, 23]
}
df = pd.DataFrame(data)

To create a Boolean mask for ages less than 28:

age_mask = df['Age'] < 28
print(age_mask)

Output:

0     True
1    False
2    False
3     True
Name: Age, dtype: bool

This mask indicates that the first and last entries satisfy the condition.

Applying a Boolean Mask using where

The mask can then be used directly with the where method:

younger_ages = df['Age'].where(age_mask)
print(younger_ages)

Output:

0    25.0
1     NaN
2     NaN
3    23.0
Name: Age, dtype: float64

 

Using equal and not equal

Comparison operators allow us to create conditions to filter our data as needed.

Let’s see how these operators function within the realm of Pandas.

For a sample DataFrame:

import pandas as pd
data = {
    'Grade': [85, 90, 78, 88, 76, 95, 89]
}
df = pd.DataFrame(data)

Using == (Equality):

To filter only the grades that are exactly 88:

exact_grade = df['Grade'].where(df['Grade'] == 88)
print(exact_grade)

Output:

0    NaN
1    NaN
2    NaN
3    88.0
4    NaN
5    NaN
6    NaN
Name: Grade, dtype: float64

Inequality and Not Equal

The != (Not Equal) operator is crucial when you want to exclude specific values:

# All grades that are not 88
not_eighty_eight = df['Grade'].where(df['Grade'] != 88)
print(not_eighty_eight)

Output:

0    85.0
1    90.0
2    78.0
3     NaN
4    76.0
5    95.0
6    89.0
Name: Grade, dtype: float64

 

Combining Multiple Conditions

By chaining multiple conditions using &, |, and ~ operators, you can define more sophisticated data filters in Pandas.

For this, let’s continue with our grades example:

import pandas as pd
data = {
    'Grade': [85, 90, 78, 88, 76, 95, 89],
    'Status': ['Pass', 'Pass', 'Fail', 'Pass', 'Fail', 'Pass', 'Pass']
}
df = pd.DataFrame(data)

Using & (AND) Operator

To filter grades greater than 80 AND with a status of ‘Pass’:

high_passing_grades = df.where((df['Grade'] > 80) & (df['Status'] == 'Pass'))
print(high_passing_grades)

Output:

   Grade Status
0   85.0   Pass
1   90.0   Pass
2    NaN    NaN
3   88.0   Pass
4    NaN    NaN
5   95.0   Pass
6   89.0   Pass

Using | (OR) Operator

To filter grades greater than 90 OR with a status of ‘Fail’:

high_or_fail = df.where((df['Grade'] > 90) | (df['Status'] == 'Fail'))
print(high_or_fail)

Output:

   Grade Status
0    NaN    NaN
1    NaN    NaN
2   78.0   Fail
3    NaN    NaN
4   76.0   Fail
5   95.0   Pass
6    NaN    NaN

Using ~ (NOT) Operator

To filter grades NOT greater than 90:

not_high_grades = df.where(~(df['Grade'] > 90))
print(not_high_grades)

Output:

   Grade Status
0   85.0   Pass
1   90.0   Pass
2   78.0   Fail
3   88.0   Pass
4   76.0   Fail
5    NaN    NaN
6   89.0   Pass

Remember that when combining conditions, it’s crucial to wrap each condition in parentheses.

 

Using Functions as Conditions

A powerful feature of the where method is the ability to use callables (like functions) as conditions for filtering.

Here’s how to use callables within the where method:

Let’s take the following DataFrame as an example:

import pandas as pd
data = {
    'Grade': [85, 90, 78, 88, 76, 95, 89]
}
df = pd.DataFrame(data)

Let’s say we want to retain grades that are odd numbers:

def is_odd(num):
    return num % 2 == 1
odd_grades = df['Grade'].where(is_odd)
print(odd_grades)

Output:

0    85.0
1     NaN
2     NaN
3     NaN
4     NaN
5    95.0
6    89.0
Name: Grade, dtype: float64

 

In-place Modification

By default, the where method doesn’t alter the original DataFrame or Series. Instead, it returns a new object with the modifications:

By setting the inplace parameter to True, the original object is modified:

df.where(df['Grade'] > 85, inplace=True)

Now, df directly reflects the changes, with grades above 85 retained and others replaced by NaN. There’s no new object returned; all modifications are done on df.

Note: When working with massive DataFrames, modifying in-place can be more memory-efficient since it doesn’t create a new object.

 

Handling NaN Values

As we saw in the previous examples, entries that don’t meet the specified condition get replaced with NaN values.

Let’s say you want entries not meeting your condition to be replaced by a static value, like -1 instead of NaN.

import pandas as pd
data = {
    'Grade': [85, 90, 78, 88, 76, 95, 89]
}
df = pd.DataFrame(data)
graded_df = df['Grade'].where(df['Grade'] > 85, other=-1)

Output:

0    -1
1    90
2    -1
3    88
4    -1
5    95
6    89
Name: Grade, dtype: int64

Using Different Values for Different Entries

What if you wish to replace each non-conforming entry with a different value? The other parameter can handle arrays or Series:

replacement_values = [10, 20, 30, 40, 50, 60, 70]
graded_df = df['Grade'].where(df['Grade'] > 85, other=replacement_values)

Output:

0    10
1    90
2    30
3    88
4    50
5    95
6    89
Name: Grade, dtype: int64

Using Functions for Dynamic Replacement

For more intricate replacement logic, you can also use functions:

def custom_replacement(grade):
    return grade / 2
graded_df = df['Grade'].where(df['Grade'] > 85, other=custom_replacement)

Output:

0    42.5
1    90.0
2    39.0
3    88.0
4    38.0
5    95.0
6    89.0
Name: Grade, dtype: float64

The other parameter in the where method provides an added layer of customization.

 

Performance of where vs. query or boolean indexing

When filtering data, three popular choices arise: the where method, the query method, and direct boolean indexing.

Let’s measure the performance of these methods.

First, we need to create a sample DataFrame:

import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({
    'A': np.random.randint(1, 1000, 1000000),
    'B': np.random.randint(1, 1000, 1000000)
})

Timing the where Method

def using_where(df):
    return df.where(df['A'] > 50)

Timing Boolean Indexing

def using_boolean_indexing(df):
    return df[df['A'] > 50]

Timing the query Method

def using_query(df):
    return df.query('A > 50')

Now, let’s time these functions using timeit:

import timeit
n_repeat = 3
n_iter = 10

where_time = timeit.timeit('using_where(df)', globals=globals(), number=n_iter) / n_iter
bool_idx_time = timeit.timeit('using_boolean_indexing(df)', globals=globals(), number=n_iter) / n_iter
query_time = timeit.timeit('using_query(df)', globals=globals(), number=n_iter) / n_iter

print(f"Average time using 'where' method over {n_iter} iterations: {where_time:.6f} seconds")
print(f"Average time using boolean indexing over {n_iter} iterations: {bool_idx_time:.6f} seconds")
print(f"Average time using 'query' method over {n_iter} iterations: {query_time:.6f} seconds")

Output:

Average time using 'where' method over 10 iterations: 0.061809 seconds
Average time using boolean indexing over 10 iterations: 0.025589 seconds
Average time using 'query' method over 10 iterations: 0.029168 seconds

From the above result, boolean indexing is definitely the faster method.

 

Nested where Calls

If you want to apply multiple conditions sequentially. One way to achieve this is by chaining or nesting where calls.

Consider a simple dataset:

import pandas as pd
data = {
    'Grade': [85, 90, 78, 88, 76, 95, 89],
    'Subject': ['Math', 'English', 'History', 'Math', 'English', 'History', 'Math']
}
df = pd.DataFrame(data)

Imagine you want to filter data for entries with a grade above 85 and the subject being ‘Math’.

You can chain where calls:

filtered_df = df.where(df['Grade'] > 85).where(df['Subject'] == 'Math')
print(filtered_df)

Output:

   Grade Subject
0    NaN     NaN
1    NaN     NaN
2    NaN     NaN
3   88.0    Math
4    NaN     NaN
5    NaN     NaN
6   89.0    Math

Nested where with the other Parameter

You can combine the other parameter for intricate replacements:

replaced_df = df['Grade'].where(df['Grade'] > 85, other=-1).where(df['Subject'] == 'Math', other=-2)
print(replaced_df)

Output:

0    -1
1    -2
2    -2
3    88
4    -2
5    -2
6    89
Name: Grade, dtype: int64

Here, grades below 85 are replaced with -1, and any row not being ‘Math’ is then further replaced with -2.

 

Resource

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html

Leave a Reply

Your email address will not be published. Required fields are marked *