Pandas NOT IN filter (Opposite of isin): Finding Mismatches

The Pandas isin method allows you to match DataFrame or Series data against a set of values.

However, in some cases, you want to retrieve records that don’t match any values in a given list.

That’s where the negation of isin comes into play.

In this tutorial, you’ll walk through various techniques to achieve this functionality.

 

 

Example of NOT IN filter

You can use the bitwise NOT operator ~ in conjunction with df['column'].isin([values])

First, let’s create a sample DataFrame:

import pandas as pd
df = pd.DataFrame({
    'CustomerID': [1, 2, 3, 4, 5],
    'Plan': ['Basic', 'Premium', 'Basic', 'Enterprise', 'Premium'],
    'MonthlyCharge': [20, 50, 20, 100, 50]
})
print(df)

Output:

   CustomerID       Plan  MonthlyCharge
0           1      Basic             20
1           2    Premium             50
2           3      Basic             20
3           4  Enterprise           100
4           5    Premium             50

To filter out the rows where the plan is neither ‘Basic’ nor ‘Premium’, you can use the following code:

filtered_df = df[~df['Plan'].isin(['Basic', 'Premium'])]
print(filtered_df)

Output:

   CustomerID       Plan  MonthlyCharge
3           4  Enterprise           100

Notice the use of the bitwise NOT operator ~. It inverts the boolean values returned by df['Plan'].isin(['Basic', 'Premium']), giving us the rows that do not match the condition.

 

Using NOT IN with Numerical Columns

Suppose you want to filter out the rows where MonthlyCharge is not 20 or 50. You can accomplish this with the following code:

filtered_numerical_df = df[~df['MonthlyCharge'].isin([20, 50])]
print(filtered_numerical_df)

Output:

   CustomerID       Plan  MonthlyCharge
3           4  Enterprise           100

The bitwise NOT operator ~ functions the same way with numerical data, inverting the boolean values and thus displaying the rows that don’t match the criteria.

 

Using NOT IN with DateTime Columns

First, let’s add a DateTime column to our DataFrame to represent when each customer joined:

df['JoinDate'] = pd.to_datetime(['2022-01-01', '2022-02-15', '2022-01-20', '2022-03-10', '2022-02-21'])
print(df)

Output:

   CustomerID       Plan  MonthlyCharge   JoinDate
0           1      Basic             20 2022-01-01
1           2    Premium             50 2022-02-15
2           3      Basic             20 2022-01-20
3           4  Enterprise           100 2022-03-10
4           5    Premium             50 2022-02-21

Now, let’s say you want to filter out customers who did not join on ‘2022-01-01’ or ‘2022-02-15’:

filtered_date_df = df[~df['JoinDate'].isin([pd.Timestamp('2022-01-01'), pd.Timestamp('2022-02-15')])]
print(filtered_date_df)

Output:

   CustomerID       Plan  MonthlyCharge   JoinDate
2           3      Basic             20 2022-01-20
3           4  Enterprise           100 2022-03-10
4           5    Premium             50 2022-02-21

Here, the bitwise NOT operator ~ filters out the rows where the JoinDate is not in the given list.

 

Applying NOT IN Filter on Multiple Columns

There may be instances where you’ll want to apply the negation of isin across multiple columns.

In such cases, you can create a boolean mask by combining multiple conditions using logical operators like & (and) or | (or).

Let’s filter out rows where Plan is not ‘Basic’ and MonthlyCharge is not 50:

mask = (~df['Plan'].isin(['Basic'])) & (~df['MonthlyCharge'].isin([50]))
filtered_multi_col_df = df[mask]
print(filtered_multi_col_df)

Output:

   CustomerID       Plan  MonthlyCharge   JoinDate
3           4  Enterprise           100 2022-03-10

We created a boolean mask using two negated isin conditions and used it to filter out the rows.

 

Combining NOT IN with Other Filters

In complex data manipulation tasks, you often need to combine the negation of isin with other kinds of filters.

You can do this by using logical operators like & (and) or | (or).

Using & and | operators for Combined Filtering

Let’s start by combining multiple conditions using the & and | operators.

Let’s filter out rows where the plan is not ‘Basic’ and MonthlyCharge is less than 50:

combined_mask = (~df['Plan'].isin(['Basic'])) & (df['MonthlyCharge'] < 100)
filtered_combined_df = df[combined_mask]
print(filtered_combined_df)

Output:

   CustomerID     Plan  MonthlyCharge
1           2  Premium             50
4           5  Premium             50

 

How NOT ISIN Interacts with NaN

Let’s introduce some NaN values into our DataFrame:

import numpy as np
df['Plan'].iloc[0] = np.nan
print(df)

Output:

   CustomerID       Plan  MonthlyCharge   JoinDate
0           1        NaN             20 2022-01-01
1           2    Premium             50 2022-02-15
2           3      Basic             20 2022-01-20
3           4  Enterprise           100 2022-03-10
4           5    Premium             50 2022-02-21

Now, let’s filter out rows where the plan is neither ‘Basic’ nor ‘Premium’:

filtered_nan_df = df[~df['Plan'].isin(['Basic', 'Premium'])]
print(filtered_nan_df)

Output:

   CustomerID       Plan  MonthlyCharge   JoinDate
0           1        NaN             20 2022-01-01
3           4  Enterprise           100 2022-03-10

The NaN value in the ‘Plan’ column is included in the output. This happens because NaN is not equal to any value, including itself, thus satisfying the NOT IN condition.

Leave a Reply

Your email address will not be published. Required fields are marked *