# Pandas where() method: Filtering with Conditions

The `where`

method in Pandas allows you to filter DataFrame or Series based on conditions, akin to SQL’s WHERE clause.

Have you ever found yourself needing to replace certain values in a DataFrame based on a specific condition, or perhaps wanting to mask data that doesn’t meet certain criteria?

The `where`

method is used to do such transformation tasks.

**Table of Contents**hide

- 1 Pandas where Syntax and Parameters
- 2 Filter DataFrame using where method
- 3 Return Type of the where Method
- 4 Creating Boolean Masks with Conditions
- 5 Using equal and not equal
- 6 Combining Multiple Conditions
- 7 Using Functions as Conditions
- 8 In-place Modification
- 9 Handling NaN Values
- 10 Performance of where vs. query or boolean indexing
- 11 Nested where Calls
- 12 Resource

## Pandas `where`

Syntax and Parameters

The `where`

method syntax looks like this:

DataFrame.where(cond, other=nan, inplace=False, axis=None, level=None)

Let’s break down these parameters:

`cond`

: This is a condition that, when satisfied, retains the original value. If not satisfied, the value will be replaced by the one specified in the`other`

parameter.`other`

: Values to replace when the condition is false. By default, it is NaN.`inplace`

: Whether to modify the calling object directly or to return a new one. Default is False.`axis`

: An optional parameter for alignment purposes. Can be either the index (0) or the columns (1).`level`

: For DataFrames with multi-level index, level where the operation should be applied.

## Filter DataFrame using `where`

method

Let’s perform basic filtering based on scalar conditions using `where`

method:

import pandas as pd data = {'Scores': [85, 90, 78, 88, 76, 95, 89]} df = pd.DataFrame(data) high_scores = df.where(df['Scores'] > 80) print(high_scores)

**Output:**

Scores 0 85.0 1 90.0 2 NaN 3 88.0 4 NaN 5 95.0 6 89.0

From the output, you can see that the scores less than or equal to 80 are replaced by NaN.

In a coming section, we’ll learn how to handle these NaN values.

Only scores above 80 are retained in the `high_scores`

DataFrame.

## Return Type of the `where`

Method

When `where`

method is applied on a DataFrame, it returns a DataFrame; when used on a Series, it returns a Series.

Let’s explore this concept with some examples:

**Working with a DataFrame**

import pandas as pd data = { 'Age': [25, 30, 35, 40], 'Salary': [50000, 60000, 70000, 80000] } df = pd.DataFrame(data) young_employees = df.where(df['Age'] < 35) print(young_employees)

**Output:**

Age Salary 0 25.0 50000.0 1 30.0 60000.0 2 NaN NaN 3 NaN NaN

Notice how the entire row is masked (replaced by NaNs) if the condition isn’t met, even if only one column’s condition was specified.

**Working with a Series**

If we apply the `where`

method to a single column (Series) of the DataFrame:

young_ages = df['Age'].where(df['Age'] < 35) print(young_ages)

**Output:**

0 25.0 1 30.0 2 NaN 3 NaN Name: Age, dtype: float64

Here, the output is a Series, not a DataFrame, because we applied the method to a single column.

## Creating Boolean Masks with Conditions

The foundation of the `where`

method’s functionality lies in creating Boolean masks based on specified conditions.

These masks are essentially arrays or series of `True`

and `False`

values, representing whether each element satisfies the given condition.

Given a DataFrame:

import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 28, 30, 23] } df = pd.DataFrame(data)

To create a Boolean mask for ages less than 28:

age_mask = df['Age'] < 28 print(age_mask)

**Output:**

0 True 1 False 2 False 3 True Name: Age, dtype: bool

This mask indicates that the first and last entries satisfy the condition.

**Applying a Boolean Mask using where**

The mask can then be used directly with the `where`

method:

younger_ages = df['Age'].where(age_mask) print(younger_ages)

**Output:**

0 25.0 1 NaN 2 NaN 3 23.0 Name: Age, dtype: float64

## Using equal and not equal

Comparison operators allow us to create conditions to filter our data as needed.

Let’s see how these operators function within the realm of Pandas.

For a sample DataFrame:

import pandas as pd data = { 'Grade': [85, 90, 78, 88, 76, 95, 89] } df = pd.DataFrame(data)

**Using == (Equality):**

To filter only the grades that are exactly 88:

exact_grade = df['Grade'].where(df['Grade'] == 88) print(exact_grade)

**Output:**

0 NaN 1 NaN 2 NaN 3 88.0 4 NaN 5 NaN 6 NaN Name: Grade, dtype: float64

**Inequality and Not Equal**

The `!=`

(Not Equal) operator is crucial when you want to exclude specific values:

# All grades that are not 88 not_eighty_eight = df['Grade'].where(df['Grade'] != 88) print(not_eighty_eight)

**Output:**

0 85.0 1 90.0 2 78.0 3 NaN 4 76.0 5 95.0 6 89.0 Name: Grade, dtype: float64

## Combining Multiple Conditions

By chaining multiple conditions using `&`

, `|`

, and `~`

operators, you can define more sophisticated data filters in Pandas.

For this, let’s continue with our grades example:

import pandas as pd data = { 'Grade': [85, 90, 78, 88, 76, 95, 89], 'Status': ['Pass', 'Pass', 'Fail', 'Pass', 'Fail', 'Pass', 'Pass'] } df = pd.DataFrame(data)

**Using & (AND) Operator**

To filter grades greater than 80 AND with a status of ‘Pass’:

high_passing_grades = df.where((df['Grade'] > 80) & (df['Status'] == 'Pass')) print(high_passing_grades)

**Output:**

Grade Status 0 85.0 Pass 1 90.0 Pass 2 NaN NaN 3 88.0 Pass 4 NaN NaN 5 95.0 Pass 6 89.0 Pass

**Using | (OR) Operator**

To filter grades greater than 90 OR with a status of ‘Fail’:

high_or_fail = df.where((df['Grade'] > 90) | (df['Status'] == 'Fail')) print(high_or_fail)

**Output:**

Grade Status 0 NaN NaN 1 NaN NaN 2 78.0 Fail 3 NaN NaN 4 76.0 Fail 5 95.0 Pass 6 NaN NaN

**Using ~ (NOT) Operator**

To filter grades NOT greater than 90:

not_high_grades = df.where(~(df['Grade'] > 90)) print(not_high_grades)

**Output:**

Grade Status 0 85.0 Pass 1 90.0 Pass 2 78.0 Fail 3 88.0 Pass 4 76.0 Fail 5 NaN NaN 6 89.0 Pass

Remember that when combining conditions, it’s crucial to wrap each condition in parentheses.

## Using Functions as Conditions

A powerful feature of the `where`

method is the ability to use callables (like functions) as conditions for filtering.

Here’s how to use callables within the `where`

method:

Let’s take the following DataFrame as an example:

import pandas as pd data = { 'Grade': [85, 90, 78, 88, 76, 95, 89] } df = pd.DataFrame(data)

Let’s say we want to retain grades that are odd numbers:

def is_odd(num): return num % 2 == 1 odd_grades = df['Grade'].where(is_odd) print(odd_grades)

**Output:**

0 85.0 1 NaN 2 NaN 3 NaN 4 NaN 5 95.0 6 89.0 Name: Grade, dtype: float64

## In-place Modification

By default, the `where`

method doesn’t alter the original DataFrame or Series. Instead, it returns a new object with the modifications:

By setting the `inplace`

parameter to `True`

, the original object is modified:

df.where(df['Grade'] > 85, inplace=True)

Now, `df`

directly reflects the changes, with grades above 85 retained and others replaced by NaN. There’s no new object returned; all modifications are done on `df`

.

**Note: **When working with massive DataFrames, modifying in-place can be more memory-efficient since it doesn’t create a new object.

## Handling NaN Values

As we saw in the previous examples, entries that don’t meet the specified condition get replaced with `NaN`

values.

Let’s say you want entries not meeting your condition to be replaced by a static value, like -1 instead of NaN.

import pandas as pd data = { 'Grade': [85, 90, 78, 88, 76, 95, 89] } df = pd.DataFrame(data) graded_df = df['Grade'].where(df['Grade'] > 85, other=-1)

**Output:**

0 -1 1 90 2 -1 3 88 4 -1 5 95 6 89 Name: Grade, dtype: int64

**Using Different Values for Different Entries**

What if you wish to replace each non-conforming entry with a different value? The `other`

parameter can handle arrays or Series:

replacement_values = [10, 20, 30, 40, 50, 60, 70] graded_df = df['Grade'].where(df['Grade'] > 85, other=replacement_values)

**Output:**

0 10 1 90 2 30 3 88 4 50 5 95 6 89 Name: Grade, dtype: int64

### Using Functions for Dynamic Replacement

For more intricate replacement logic, you can also use functions:

def custom_replacement(grade): return grade / 2 graded_df = df['Grade'].where(df['Grade'] > 85, other=custom_replacement)

**Output:**

0 42.5 1 90.0 2 39.0 3 88.0 4 38.0 5 95.0 6 89.0 Name: Grade, dtype: float64

The `other`

parameter in the `where`

method provides an added layer of customization.

## Performance of where vs. query or boolean indexing

When filtering data, three popular choices arise: the `where`

method, the `query`

method, and direct boolean indexing.

Let’s measure the performance of these methods.

First, we need to create a sample DataFrame:

import pandas as pd import numpy as np np.random.seed(42) df = pd.DataFrame({ 'A': np.random.randint(1, 1000, 1000000), 'B': np.random.randint(1, 1000, 1000000) })

**Timing the where Method**

def using_where(df): return df.where(df['A'] > 50)

**Timing Boolean Indexing**

def using_boolean_indexing(df): return df[df['A'] > 50]

**Timing the query Method**

def using_query(df): return df.query('A > 50')

Now, let’s time these functions using timeit:

import timeit n_repeat = 3 n_iter = 10 where_time = timeit.timeit('using_where(df)', globals=globals(), number=n_iter) / n_iter bool_idx_time = timeit.timeit('using_boolean_indexing(df)', globals=globals(), number=n_iter) / n_iter query_time = timeit.timeit('using_query(df)', globals=globals(), number=n_iter) / n_iter print(f"Average time using 'where' method over {n_iter} iterations: {where_time:.6f} seconds") print(f"Average time using boolean indexing over {n_iter} iterations: {bool_idx_time:.6f} seconds") print(f"Average time using 'query' method over {n_iter} iterations: {query_time:.6f} seconds")

**Output:**

Average time using 'where' method over 10 iterations: 0.061809 seconds Average time using boolean indexing over 10 iterations: 0.025589 seconds Average time using 'query' method over 10 iterations: 0.029168 seconds

From the above result, boolean indexing is definitely the faster method.

## Nested `where`

Calls

If you want to apply multiple conditions sequentially. One way to achieve this is by chaining or nesting `where`

calls.

Consider a simple dataset:

import pandas as pd data = { 'Grade': [85, 90, 78, 88, 76, 95, 89], 'Subject': ['Math', 'English', 'History', 'Math', 'English', 'History', 'Math'] } df = pd.DataFrame(data)

Imagine you want to filter data for entries with a grade above 85 and the subject being ‘Math’.

You can chain `where`

calls:

filtered_df = df.where(df['Grade'] > 85).where(df['Subject'] == 'Math') print(filtered_df)

**Output:**

Grade Subject 0 NaN NaN 1 NaN NaN 2 NaN NaN 3 88.0 Math 4 NaN NaN 5 NaN NaN 6 89.0 Math

**Nested where with the other Parameter**

You can combine the `other`

parameter for intricate replacements:

replaced_df = df['Grade'].where(df['Grade'] > 85, other=-1).where(df['Subject'] == 'Math', other=-2) print(replaced_df)

**Output:**

0 -1 1 -2 2 -2 3 88 4 -2 5 -2 6 89 Name: Grade, dtype: int64

Here, grades below 85 are replaced with -1, and any row not being ‘Math’ is then further replaced with -2.

## Resource

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html

Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.