Pandas isin Method: Efficient Data Filtering in Python

The isin method in Pandas is used to filter DataFrames and Series, it allows you to select rows where a column (or columns) contains specific values.

In this tutorial, we will explore its syntax and parameters, basic usage for filtering rows, using dictionaries and sets for lookup, handling multiple conditions, and more.

 

 

Pandas isin() Syntax and Parameters

Here’s the basic syntax for the isin method when applied to a Series:

Series.isin(values)

And for a DataFrame:

DataFrame.isin(values)

Where: values can be a list, a Series, or a DataFrame.

Parameters:

  • values: The sequence of values you want to search for. This could be a list, a Series, or even another DataFrame.

The method returns a Boolean Series or DataFrame (of the same shape as the original), which can then be used to filter out rows/columns from the original data.

 

Use isin to Filter Rows with a List of Values

Let’s assume you have a DataFrame with a ‘Fruit’ column and you want to filter rows where the fruit is either ‘Apple’ or ‘Banana’.

import pandas as pd
df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig', 'Apple', 'Banana', 'Cherry'],
    'Price': [1, 0.5, 2, 1.5, 3, 1.2, 0.6, 2.1]
})

# Filtering rows with isin method
filtered_df = df[df['Fruit'].isin(['Apple', 'Banana'])]
print(filtered_df)

Output:

    Fruit  Price
0   Apple    1.0
1  Banana    0.5
5   Apple    1.2
6  Banana    0.6

By using the isin method, you have successfully filtered the rows where the fruit is either ‘Apple’ or ‘Banana’.

 

Handling Multiple Conditions

Combining the isin method with other conditional operations using & (and), | (or), and ~ (not) operators allows you to build complex filtering conditions.

Here’s how you can use them to manage multiple conditions:

Using & (AND Operator)

Suppose you want to filter a DataFrame for rows where ‘Fruit’ is either ‘Apple’ or ‘Banana’ AND the ‘Price’ is less than 1.

df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig', 'Apple', 'Banana', 'Cherry'],
    'Price': [1, 0.5, 2, 1.5, 3, 1.2, 0.6, 2.1]
})
filtered_df = df[df['Fruit'].isin(['Apple', 'Banana']) & (df['Price'] < 1)]
print(filtered_df)

Output:

    Fruit  Price
1  Banana    0.5
6  Banana    0.6

Here, you get only the rows where the fruit is either ‘Apple’ or ‘Banana’ and its price is under 1.

Using | (OR Operator)

Now, let’s filter rows where ‘Fruit’ is ‘Apple’ OR the ‘Price’ is greater than 2.

filtered_df = df[df['Fruit'].isin(['Apple']) | (df['Price'] > 2)]
print(filtered_df)

Output:

    Fruit  Price
0   Apple    1.0
4     Fig    3.0
5   Apple    1.2
7  Cherry    2.1

The resulting DataFrame includes rows where the fruit is ‘Apple’ or the price is more than 2.

Using ~ (NOT Operator)

To get rows where ‘Fruit’ is NOT ‘Apple’ or ‘Banana’:

filtered_df = df[~df['Fruit'].isin(['Apple', 'Banana'])]
print(filtered_df)

Output:

    Fruit  Price
2  Cherry    2.0
3    Date    1.5
4     Fig    3.0
7  Cherry    2.1

By using the ~ operator, you’re excluding all rows where the fruit is either ‘Apple’ or ‘Banana’.

 

Using isin for Case-Insensitive Filtering

To ensure case-insensitive lookup, you can transform both the filter list and the DataFrame column to lowercase (or uppercase) before applying the isin method:

filter_fruits = ['APPLE', 'date']
filtered_df = df[df['Fruit'].str.lower().isin([fruit.lower() for fruit in filter_fruits])]
print(filtered_df)

Output:

   Fruit  Price
0  Apple    1.0
3   Date    1.5
5  Apple    1.2

With this approach, filtered_df will capture rows with fruits ‘Apple’, ‘Date’, and the second occurrence of ‘Apple’, regardless of the case mismatch between the DataFrame and the filter list.

Note: For large DataFrames, transforming text cases across entire columns can be computationally intensive.

 

Using isin with a Dictionary for Multiple Columns

Given a dictionary, where the keys are column names and the corresponding values is a list of acceptable values for that column, the isin method will return a DataFrame of boolean values.

Here’s how you can employ this technique:

df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig', 'Apple', 'Banana', 'Cherry'],
    'Color': ['Red', 'Yellow', 'Red', 'Brown', 'Purple', 'Green', 'Yellow', 'Red']
})

# Using isin with a dictionary
criteria = {
    'Fruit': ['Apple', 'Banana'],
    'Color': ['Red', 'Yellow']
}
filtered_mask = df.isin(criteria)
print(filtered_mask)

Output:

   Fruit  Color
0   True   True
1   True   True
2  False   True
3  False  False
4  False  False
5   True  False
6   True   True
7  False   True

From the boolean DataFrame above, you can see where each cell’s value meets the criteria specified in the dictionary. For example, the first row (index 0) meets both criteria: ‘Fruit’ is ‘Apple’ and ‘Color’ is ‘Red’.

Filtering Based on Multiple Column Criteria

To filter the original DataFrame where both the ‘Fruit’ and ‘Color’ columns satisfy their respective criteria, you’d combine the boolean values:

final_filtered_df = df[filtered_mask.all(axis=1)]
print(final_filtered_df)

Output:

    Fruit   Color
0   Apple     Red
1  Banana  Yellow
6  Banana  Yellow

The all(axis=1) method checks if all the conditions (columns) are True for a given row, and the resulting DataFrame only contains the rows that meet the criteria for both columns.

 

Dynamic List Creation for isin Input

You can make the filter from the user input. Given the following DataFrame:

sales_df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Mango', 'Cherry', 'Papaya', 'Apple', 'Banana'],
    'Units': [100, 150, 80, 90, 50, 110, 140]
})

For more interactivity, you can ask the user which fruits they’re interested in:

# Asking user for input
user_fruits = input("Enter the names of fruits separated by commas: ").split(",")

# Stripping potential white spaces and converting to list
user_fruits_list = [fruit.strip() for fruit in user_fruits]

# Filtering based on user input
filtered_sales = sales_df[sales_df['Fruit'].isin(user_fruits_list)]
print(filtered_sales)

 

Using isin with Another DataFrame

You can use isin to filter values in one DataFrame based on values from another DataFrame.

Let’s create both DataFrames:

import pandas as pd
data = {
    'ID': [101, 102, 103, 104, 105],
    'Product': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig']
}
df = pd.DataFrame(data)

filter_data = {
    'Product': ['Apple', 'Date']
}
filter_df = pd.DataFrame(filter_data)

To filter the main DataFrame (df) based on values from the filter DataFrame (filter_df):

filtered_by_df = df[df['Product'].isin(filter_df['Product'])]
print(filtered_by_df)

Output:

    ID Product
0  101   Apple
3  104    Date

If you’re generating filters on the fly or they’re subject to frequent change, it’s easier to update a separate filter DataFrame rather than adjusting lists or conditions manually.

 

Handling Null or Missing Values

When using isin with a list that contains NaN or None, it can match and filter the rows with missing values.

df = pd.DataFrame({
    'Fruit': ['Apple', None, 'Cherry', 'Date', 'Fig', 'Apple', None, 'Cherry']
})

# Filtering rows using isin with None
filtered_df = df[df['Fruit'].isin([None])]
print(filtered_df)

Output:

  Fruit
1  None
6  None

Combining isin with notna()

You can filter out missing values while using isin by chaining the notna() method.

fruits_list = ['Apple', None]
filtered_df = df[df['Fruit'].isin(fruits_list) & df['Fruit'].notna()]
print(filtered_df)

Output:

    Fruit
0   Apple
5   Apple

By adding the notna() method, you ensure that the result doesn’t contain any missing values, even if the list contains None or NaN.

Using isna() with isin()

You can combine isin with isna. if you want to get specific values or missing ones:

fruits_list = ['Apple', 'Cherry']
filtered_df = df[df['Fruit'].isin(fruits_list) | df['Fruit'].isna()]
print(filtered_df)

Output:

    Fruit
0   Apple
1    None
2  Cherry
5   Apple
6    None
7  Cherry

By combining isin with isna, the resulting DataFrame contains rows where the fruit is either ‘Apple’ or ‘Cherry’ or has a missing value.

 

Why isin is super fast lookup?

The isin method internally optimizes the operation by converting the input iterable to a set when checking membership.

This is because checking membership in a set is O(1) on average, whereas checking membership in a list is O(n).

To demonstrate the difference in performance between lists and sets, you should avoid using the isin method because it already uses sets.

Instead, you can use Python’s native in operator to check membership in a loop.

import pandas as pd
import time

df = pd.DataFrame({
    'Value': list(range(1, 100001))
})

# Defining a list and a set
lookup_list = list(range(50001, 150001))
lookup_set = set(lookup_list)

# Time taken using list with native in operator
start_time = time.time()
df_list = df[df['Value'].apply(lambda x: x in lookup_list)]
end_time = time.time()
print(f"Time with list: {end_time - start_time} seconds")

# Time taken using set with native in operator
start_time = time.time()
df_set = df[df['Value'].apply(lambda x: x in lookup_set)]
end_time = time.time()
print(f"Time with set: {end_time - start_time} seconds")

Output:

Time with list: 101.42362904548645 seconds
Time with set: 0.03318667411804199 seconds

The set is significantly faster than a list. That’s why isin is super fast.

 

Chaining with loc and iloc

Both loc and iloc are indexing functions in Pandas.

Using isin with loc

loc lets you access a group of rows and columns by labels. When combined with isin, you can filter rows based on certain conditions while also selecting specific columns.

Consider a sample DataFrame:

df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Mango', 'Cherry'],
    'Price': [0.5, 0.3, 1.0, 0.8],
    'Quantity': [100, 150, 50, 80]
})

To filter rows where the Fruit is either ‘Apple’ or ‘Mango’, and only select the ‘Price’ column:

selected_prices = df.loc[df['Fruit'].isin(['Apple', 'Mango']), 'Price']
print(selected_prices)

Output:

0    0.5
2    1.0
Name: Price, dtype: float64

Using isin with iloc

While iloc is primarily for integer-based location indexing, you can combine boolean masks like the one produced by isin with column integer indices to get desired selections.

To achieve the same result as the previous example using iloc:

row_mask = df['Fruit'].isin(['Apple', 'Mango'])
selected_prices = df.iloc[row_mask.values, 1]
print(selected_prices)

Output:

0    0.5
2    1.0
Name: Price, dtype: float64

Remember that while using iloc, you need to convert the boolean Series (the mask) to its values using  the .values attribute.

Modifying Values Using isin with loc

You can also use isin with loc to modify specific entries in the DataFrame.

Let’s say you want to give a discount, reducing the price by 10% for ‘Apple’ and ‘Mango’:

df.loc[df['Fruit'].isin(['Apple', 'Mango']), 'Price'] *= 0.9
print(df)

Output:

    Fruit  Price  Quantity
0   Apple   0.45       100
1  Banana   0.30       150
2   Mango   0.90        50
3  Cherry   0.80        80

Notice how the prices for ‘Apple’ and ‘Mango’ have been modified.

 

Using with where Method

When combined with isin, the where method allows for expressive conditional selection.

Consider a sample DataFrame:

import pandas as pd
df = pd.DataFrame({
    'Fruit': ['Apple', 'Banana', 'Mango', 'Cherry', 'Grape'],
    'Price': [0.5, 0.3, 1.0, 0.8, 0.6]
})

To keep the values of rows where the Fruit is either ‘Apple’ or ‘Mango’ and replace others with NaN:

filtered_df = df.where(df['Fruit'].isin(['Apple', 'Mango']))
print(filtered_df)

Output:

    Fruit  Price
0   Apple    0.5
1     NaN    NaN
2   Mango    1.0
3     NaN    NaN
4     NaN    NaN

By default, where replaces non-matching rows with NaN. You can specify a different value or even a DataFrame of replacement values.

For instance, replacing the non-matching fruits with ‘Not Selected’ and their prices with 0:

replacement_df = pd.DataFrame({
    'Fruit': ['Not Selected'] * len(df),
    'Price': [0] * len(df)
})
filtered_df = df.where(df['Fruit'].isin(['Apple', 'Mango']), other=replacement_df)
print(filtered_df)

Output:

         Fruit  Price
0        Apple    0.5
1  Not Selected    0.0
2        Mango    1.0
3  Not Selected    0.0
4  Not Selected    0.0

Combining where and isin with Multiple Conditions

Let’s say we want to retain fruits that are either ‘Apple’ or ‘Mango’, but only if their price is greater than 0.4:

filtered_df = df.where(df['Fruit'].isin(['Apple', 'Mango']) & (df['Price'] > 0.4))
print(filtered_df)

Output:

   Fruit  Price
0  Apple    0.5
1    NaN    NaN
2  Mango    1.0
3    NaN    NaN
4    NaN    NaN

 

isin vs. Direct Comparison

For a single value or a small set of values, direct comparisons using logical operators (==, !=, &, |) is more efficient and straightforward.

As the number of values you want to check against grows, using multiple direct comparisons can become cumbersome and less readable. This is where isin comes in:

fruits_list = ['Apple', 'Mango', 'Grape', 'Peach', 'Strawberry', ...]  # a long list
filtered_df = df[df['Fruit'].isin(fruits_list)]

Let’s perform a simple speed test to know which one works faster:

import pandas as pd
import numpy as np
import timeit

# Create a sample dataframe
np.random.seed(42)
n = 10**6  # Number of rows in the DataFrame
df = pd.DataFrame({
    'A': np.random.randint(0, 1000, n)
})

# Some values to be checked
values_to_check = list(range(500, 510))

# isin method
def using_isin():
    return df[df['A'].isin(values_to_check)]

# Direct comparison method using chaining comparison
def using_direct_comparison():
    return df[(500 <= df['A']) & (df['A'] <= 509)]

# Measure the performance
isin_time = timeit.timeit(using_isin, number=100)
chaining_comparison_time = timeit.timeit(using_direct_comparison, number=100)

print(f"Time using isin: {isin_time:.5f} seconds")
print(f"Time using direct comparison: {chaining_comparison_time:.5f} seconds")

Output:

Time using isin: 3.27576 seconds
Time using direct comparison: 0.60838 seconds

Direct comparison is much faster!

Note that while the direct comparison is faster in this specific case (consecutive integers), it’s not always the case.

By using non-consecutive values or more complex filtering criteria, the difference is not that much:

import pandas as pd
import numpy as np
import timeit

# Create a sample dataframe
np.random.seed(42)
n = 10**6  # Number of rows in the DataFrame
df = pd.DataFrame({
    'A': np.random.randint(0, 1000, n)
})

values_to_check = [500, 502, 505, 507, 510, 515, 520]

# isin method combined with other conditions
def using_isin():
    even_and_greater_than_700 = (df['A'] > 700) & (df['A'] % 2 == 0)
    squared_ends_in_25 = (df['A']**2) % 100 == 25
    return df[df['A'].isin(values_to_check) | even_and_greater_than_700 | squared_ends_in_25]

# Direct comparison method using == and or combined with other conditions
def using_direct_comparison():
    return df[
        (df['A'] == 500) |
        (df['A'] == 502) |
        (df['A'] == 505) |
        (df['A'] == 507) |
        (df['A'] == 510) |
        (df['A'] == 515) |
        (df['A'] == 520) |
        ((df['A'] > 700) & (df['A'] % 2 == 0)) |
        ((df['A']**2) % 100 == 25)
    ]

# Measure the performance
isin_time = timeit.timeit(using_isin, number=100)
direct_comparison_time = timeit.timeit(using_direct_comparison, number=100)

print(f"Time using isin : {isin_time:.5f} seconds")
print(f"Time using direct comparison: {direct_comparison_time:.5f} seconds")

Output:

Time using isin : 6.36738 seconds
Time using direct comparison: 6.16720 seconds

Direct comparison is slightly faster.

 

Practical Examples (Lookup Logic with isin)

Let’s say you’re analyzing a dataset of book titles. You want to categorize these titles based on a genre lookup.

Instead of writing complex logic, you can use isin for this task.

# Sample books DataFrame
books = pd.DataFrame({
    'Title': ['The Hobbit', 'War and Peace', 'The Great Gatsby', 'A Brief History of Time', 'Pride and Prejudice'],
})

# Genre lookup lists
fantasy_titles = ['The Hobbit', 'Harry Potter', 'Lord of the Rings']
classic_titles = ['War and Peace', 'The Great Gatsby', 'Pride and Prejudice']

# Using isin for lookup logic
books['Genre'] = 'Other'
books.loc[books['Title'].isin(fantasy_titles), 'Genre'] = 'Fantasy'
books.loc[books['Title'].isin(classic_titles), 'Genre'] = 'Classic'
print(books)

Output:

                    Title    Genre
0              The Hobbit  Fantasy
1           War and Peace  Classic
2        The Great Gatsby  Classic
3  A Brief History of Time    Other
4      Pride and Prejudice  Classic

Using isin in combination with loc, we’ve categorized each book based on its genre.

 

Resource

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html

Leave a Reply

Your email address will not be published. Required fields are marked *