Handle NaN values in Pandas query Method

If you’re dealing with a dataset that contains NaN (Not a Number) values, using .query() will yield unexpected results.

In this tutorial, you’ll learn how to handle NaN values when using the .query() method.

 

 

Use isna() to Filter Rows with NaN Values

In Pandas, the isna() function is used to identify NaN values in a DataFrame.

You can use isna() directly within the .query() method or using variables.

First, create a sample DataFrame:

import pandas as pd
data = {
    'customer_id': [1, 2, 3, 4, 5],
    'account_balance': [200.0, 300.0, None, 500.0, None],
    'last_purchase_date': ['2021-01-01', '2021-06-15', '2021-05-20', None, '2021-10-01']
}
df = pd.DataFrame(data)
print(df)

Output:

   customer_id  account_balance last_purchase_date
0            1            200.0         2021-01-01
1            2            300.0         2021-06-15
2            3              NaN         2021-05-20
3            4            500.0               None
4            5              NaN         2021-10-01

In this dataset, the account_balance and last_purchase_date columns contain NaN values.

To filter out rows where account_balance is NaN, use the following code:

isna_var = pd.isna
filtered_df = df.query('@isna_var(account_balance)')
print(filtered_df)

Output:

   customer_id  account_balance last_purchase_date
2            3              NaN         2021-05-20
4            5              NaN         2021-10-01

You will get the same output if you use isna() directly:

filtered_df = df.query('account_balance.isna()')
print(filtered_df)

 

Using notna() to Exclude NaN Values

You can use the notna() function to exclude NaN values from your query results.

Continue with the previous DataFrame, and to filter rows where account_balance is not NaN, run the following code:

notna_var = pd.notna
filtered_df_notna = df.query('@notna_var(account_balance)')
print(filtered_df_notna)

Output:

   customer_id  account_balance last_purchase_date
0            1            200.0         2021-01-01
1            2            300.0         2021-06-15
3            4            500.0               None

The external variable notna_var allows us to incorporate pd.notna() within the .query() method.

 

Combining Multiple Conditions Involving NaN Values

You can combine multiple conditions using logical operators like & (and), | (or), and ~ (not).

Suppose you want to find all rows where account_balance is NaN or last_purchase_date is not NaN. You can do this as follows:

isna_var = pd.isna
notna_var = pd.notna
filtered_df_multiple = df.query('@isna_var(account_balance) | @notna_var(last_purchase_date)')
print(filtered_df_multiple)

Output:

   customer_id  account_balance last_purchase_date
0            1            200.0         2021-01-01
1            2            300.0         2021-06-15
2            3              NaN         2021-05-20
4            5              NaN         2021-10-01

Notice the use of the | operator to combine the two conditions. Similarly, you can use & to require both conditions to be true, and ~ to negate a condition.

 

Querying NaN Values in Multi-Index DataFrames

You might encounter DataFrames with NaN values when you’re dealing with hierarchical data or after using functions like .groupby() with multiple columns.

Let’s create a multi-index DataFrame for demonstration:

arrays = [
    ['A', 'A', 'B', 'B'],
    [1, 2, 1, 2]
]
index = pd.MultiIndex.from_arrays(arrays, names=('letters', 'numbers'))
data = {
    'account_balance': [200.0, None, 300.0, None],
    'last_purchase_date': ['2021-01-01', None, '2021-06-15', '2021-05-20']
}
multi_df = pd.DataFrame(data, index=index)
print(multi_df)

Output:

                 account_balance last_purchase_date
letters numbers                                   
A       1                  200.0         2021-01-01
        2                    NaN               None
B       1                  300.0         2021-06-15
        2                    NaN         2021-05-20

Now, to query this DataFrame and find rows where account_balance is NaN, you can use:

isna_var = pd.isna
filtered_multi_df = multi_df.query('@isna_var(account_balance)')
print(filtered_multi_df)

Output:

                 account_balance last_purchase_date
letters numbers                                   
A       2                    NaN               None
B       2                    NaN         2021-05-20

As you can see, the DataFrame has been filtered to only include rows where account_balance is NaN, regardless of the multi-index setup.

Leave a Reply

Your email address will not be published. Required fields are marked *