Pandas isin Method: Efficient Data Filtering in Python
The isin
method in Pandas is used to filter DataFrames and Series, it allows you to select rows where a column (or columns) contains specific values.
In this tutorial, we will explore its syntax and parameters, basic usage for filtering rows, using dictionaries and sets for lookup, handling multiple conditions, and more.
- 1 Pandas isin() Syntax and Parameters
- 2 Use isin to Filter Rows with a List of Values
- 3 Handling Multiple Conditions
- 4 Using isin for Case-Insensitive Filtering
- 5 Using isin with a Dictionary for Multiple Columns
- 6 Dynamic List Creation for isin Input
- 7 Using isin with Another DataFrame
- 8 Handling Null or Missing Values
- 9 Why isin is super fast lookup?
- 10 Chaining with loc and iloc
- 11 Using with where Method
- 12 isin vs. Direct Comparison
- 13 Practical Examples (Lookup Logic with isin)
- 14 Resource
Pandas isin() Syntax and Parameters
Here’s the basic syntax for the isin
method when applied to a Series:
Series.isin(values)
And for a DataFrame:
DataFrame.isin(values)
Where: values
can be a list, a Series, or a DataFrame.
Parameters:
- values: The sequence of values you want to search for. This could be a list, a Series, or even another DataFrame.
The method returns a Boolean Series or DataFrame (of the same shape as the original), which can then be used to filter out rows/columns from the original data.
Use isin to Filter Rows with a List of Values
Let’s assume you have a DataFrame with a ‘Fruit’ column and you want to filter rows where the fruit is either ‘Apple’ or ‘Banana’.
import pandas as pd df = pd.DataFrame({ 'Fruit': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig', 'Apple', 'Banana', 'Cherry'], 'Price': [1, 0.5, 2, 1.5, 3, 1.2, 0.6, 2.1] }) # Filtering rows with isin method filtered_df = df[df['Fruit'].isin(['Apple', 'Banana'])] print(filtered_df)
Output:
Fruit Price 0 Apple 1.0 1 Banana 0.5 5 Apple 1.2 6 Banana 0.6
By using the isin
method, you have successfully filtered the rows where the fruit is either ‘Apple’ or ‘Banana’.
Handling Multiple Conditions
Combining the isin
method with other conditional operations using &
(and), |
(or), and ~
(not) operators allows you to build complex filtering conditions.
Here’s how you can use them to manage multiple conditions:
Using & (AND Operator)
Suppose you want to filter a DataFrame for rows where ‘Fruit’ is either ‘Apple’ or ‘Banana’ AND the ‘Price’ is less than 1.
df = pd.DataFrame({ 'Fruit': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig', 'Apple', 'Banana', 'Cherry'], 'Price': [1, 0.5, 2, 1.5, 3, 1.2, 0.6, 2.1] }) filtered_df = df[df['Fruit'].isin(['Apple', 'Banana']) & (df['Price'] < 1)] print(filtered_df)
Output:
Fruit Price 1 Banana 0.5 6 Banana 0.6
Here, you get only the rows where the fruit is either ‘Apple’ or ‘Banana’ and its price is under 1.
Using | (OR Operator)
Now, let’s filter rows where ‘Fruit’ is ‘Apple’ OR the ‘Price’ is greater than 2.
filtered_df = df[df['Fruit'].isin(['Apple']) | (df['Price'] > 2)] print(filtered_df)
Output:
Fruit Price 0 Apple 1.0 4 Fig 3.0 5 Apple 1.2 7 Cherry 2.1
The resulting DataFrame includes rows where the fruit is ‘Apple’ or the price is more than 2.
Using ~ (NOT Operator)
To get rows where ‘Fruit’ is NOT ‘Apple’ or ‘Banana’:
filtered_df = df[~df['Fruit'].isin(['Apple', 'Banana'])] print(filtered_df)
Output:
Fruit Price 2 Cherry 2.0 3 Date 1.5 4 Fig 3.0 7 Cherry 2.1
By using the ~
operator, you’re excluding all rows where the fruit is either ‘Apple’ or ‘Banana’.
Using isin for Case-Insensitive Filtering
To ensure case-insensitive lookup, you can transform both the filter list and the DataFrame column to lowercase (or uppercase) before applying the isin
method:
filter_fruits = ['APPLE', 'date'] filtered_df = df[df['Fruit'].str.lower().isin([fruit.lower() for fruit in filter_fruits])] print(filtered_df)
Output:
Fruit Price 0 Apple 1.0 3 Date 1.5 5 Apple 1.2
With this approach, filtered_df
will capture rows with fruits ‘Apple’, ‘Date’, and the second occurrence of ‘Apple’, regardless of the case mismatch between the DataFrame and the filter list.
Note: For large DataFrames, transforming text cases across entire columns can be computationally intensive.
Using isin with a Dictionary for Multiple Columns
Given a dictionary, where the keys are column names and the corresponding values is a list of acceptable values for that column, the isin
method will return a DataFrame of boolean values.
Here’s how you can employ this technique:
df = pd.DataFrame({ 'Fruit': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig', 'Apple', 'Banana', 'Cherry'], 'Color': ['Red', 'Yellow', 'Red', 'Brown', 'Purple', 'Green', 'Yellow', 'Red'] }) # Using isin with a dictionary criteria = { 'Fruit': ['Apple', 'Banana'], 'Color': ['Red', 'Yellow'] } filtered_mask = df.isin(criteria) print(filtered_mask)
Output:
Fruit Color 0 True True 1 True True 2 False True 3 False False 4 False False 5 True False 6 True True 7 False True
From the boolean DataFrame above, you can see where each cell’s value meets the criteria specified in the dictionary. For example, the first row (index 0) meets both criteria: ‘Fruit’ is ‘Apple’ and ‘Color’ is ‘Red’.
Filtering Based on Multiple Column Criteria
To filter the original DataFrame where both the ‘Fruit’ and ‘Color’ columns satisfy their respective criteria, you’d combine the boolean values:
final_filtered_df = df[filtered_mask.all(axis=1)] print(final_filtered_df)
Output:
Fruit Color 0 Apple Red 1 Banana Yellow 6 Banana Yellow
The all(axis=1)
method checks if all the conditions (columns) are True
for a given row, and the resulting DataFrame only contains the rows that meet the criteria for both columns.
Dynamic List Creation for isin Input
You can make the filter from the user input. Given the following DataFrame:
sales_df = pd.DataFrame({ 'Fruit': ['Apple', 'Banana', 'Mango', 'Cherry', 'Papaya', 'Apple', 'Banana'], 'Units': [100, 150, 80, 90, 50, 110, 140] })
For more interactivity, you can ask the user which fruits they’re interested in:
# Asking user for input user_fruits = input("Enter the names of fruits separated by commas: ").split(",") # Stripping potential white spaces and converting to list user_fruits_list = [fruit.strip() for fruit in user_fruits] # Filtering based on user input filtered_sales = sales_df[sales_df['Fruit'].isin(user_fruits_list)] print(filtered_sales)
Using isin with Another DataFrame
You can use isin
to filter values in one DataFrame based on values from another DataFrame.
Let’s create both DataFrames:
import pandas as pd data = { 'ID': [101, 102, 103, 104, 105], 'Product': ['Apple', 'Banana', 'Cherry', 'Date', 'Fig'] } df = pd.DataFrame(data) filter_data = { 'Product': ['Apple', 'Date'] } filter_df = pd.DataFrame(filter_data)
To filter the main DataFrame (df
) based on values from the filter DataFrame (filter_df
):
filtered_by_df = df[df['Product'].isin(filter_df['Product'])] print(filtered_by_df)
Output:
ID Product 0 101 Apple 3 104 Date
If you’re generating filters on the fly or they’re subject to frequent change, it’s easier to update a separate filter DataFrame rather than adjusting lists or conditions manually.
Handling Null or Missing Values
When using isin
with a list that contains NaN or None, it can match and filter the rows with missing values.
df = pd.DataFrame({ 'Fruit': ['Apple', None, 'Cherry', 'Date', 'Fig', 'Apple', None, 'Cherry'] }) # Filtering rows using isin with None filtered_df = df[df['Fruit'].isin([None])] print(filtered_df)
Output:
Fruit 1 None 6 None
Combining isin with notna()
You can filter out missing values while using isin
by chaining the notna()
method.
fruits_list = ['Apple', None] filtered_df = df[df['Fruit'].isin(fruits_list) & df['Fruit'].notna()] print(filtered_df)
Output:
Fruit 0 Apple 5 Apple
By adding the notna()
method, you ensure that the result doesn’t contain any missing values, even if the list contains None or NaN.
Using isna() with isin()
You can combine isin
with isna
. if you want to get specific values or missing ones:
fruits_list = ['Apple', 'Cherry'] filtered_df = df[df['Fruit'].isin(fruits_list) | df['Fruit'].isna()] print(filtered_df)
Output:
Fruit 0 Apple 1 None 2 Cherry 5 Apple 6 None 7 Cherry
By combining isin
with isna
, the resulting DataFrame contains rows where the fruit is either ‘Apple’ or ‘Cherry’ or has a missing value.
Why isin is super fast lookup?
The isin
method internally optimizes the operation by converting the input iterable to a set when checking membership.
This is because checking membership in a set is O(1) on average, whereas checking membership in a list is O(n).
To demonstrate the difference in performance between lists and sets, you should avoid using the isin
method because it already uses sets.
Instead, you can use Python’s native in
operator to check membership in a loop.
import pandas as pd import time df = pd.DataFrame({ 'Value': list(range(1, 100001)) }) # Defining a list and a set lookup_list = list(range(50001, 150001)) lookup_set = set(lookup_list) # Time taken using list with native in operator start_time = time.time() df_list = df[df['Value'].apply(lambda x: x in lookup_list)] end_time = time.time() print(f"Time with list: {end_time - start_time} seconds") # Time taken using set with native in operator start_time = time.time() df_set = df[df['Value'].apply(lambda x: x in lookup_set)] end_time = time.time() print(f"Time with set: {end_time - start_time} seconds")
Output:
Time with list: 101.42362904548645 seconds Time with set: 0.03318667411804199 seconds
The set is significantly faster than a list. That’s why isin
is super fast.
Chaining with loc and iloc
Both loc and iloc are indexing functions in Pandas.
Using isin with loc
loc
lets you access a group of rows and columns by labels. When combined with isin
, you can filter rows based on certain conditions while also selecting specific columns.
Consider a sample DataFrame:
df = pd.DataFrame({ 'Fruit': ['Apple', 'Banana', 'Mango', 'Cherry'], 'Price': [0.5, 0.3, 1.0, 0.8], 'Quantity': [100, 150, 50, 80] })
To filter rows where the Fruit is either ‘Apple’ or ‘Mango’, and only select the ‘Price’ column:
selected_prices = df.loc[df['Fruit'].isin(['Apple', 'Mango']), 'Price'] print(selected_prices)
Output:
0 0.5 2 1.0 Name: Price, dtype: float64
Using isin with iloc
While iloc
is primarily for integer-based location indexing, you can combine boolean masks like the one produced by isin
with column integer indices to get desired selections.
To achieve the same result as the previous example using iloc
:
row_mask = df['Fruit'].isin(['Apple', 'Mango']) selected_prices = df.iloc[row_mask.values, 1] print(selected_prices)
Output:
0 0.5 2 1.0 Name: Price, dtype: float64
Remember that while using iloc
, you need to convert the boolean Series (the mask) to its values using the .values
attribute.
Modifying Values Using isin with loc
You can also use isin
with loc
to modify specific entries in the DataFrame.
Let’s say you want to give a discount, reducing the price by 10% for ‘Apple’ and ‘Mango’:
df.loc[df['Fruit'].isin(['Apple', 'Mango']), 'Price'] *= 0.9 print(df)
Output:
Fruit Price Quantity 0 Apple 0.45 100 1 Banana 0.30 150 2 Mango 0.90 50 3 Cherry 0.80 80
Notice how the prices for ‘Apple’ and ‘Mango’ have been modified.
Using with where Method
When combined with isin
, the where
method allows for expressive conditional selection.
Consider a sample DataFrame:
import pandas as pd df = pd.DataFrame({ 'Fruit': ['Apple', 'Banana', 'Mango', 'Cherry', 'Grape'], 'Price': [0.5, 0.3, 1.0, 0.8, 0.6] })
To keep the values of rows where the Fruit is either ‘Apple’ or ‘Mango’ and replace others with NaN:
filtered_df = df.where(df['Fruit'].isin(['Apple', 'Mango'])) print(filtered_df)
Output:
Fruit Price 0 Apple 0.5 1 NaN NaN 2 Mango 1.0 3 NaN NaN 4 NaN NaN
By default, where
replaces non-matching rows with NaN. You can specify a different value or even a DataFrame of replacement values.
For instance, replacing the non-matching fruits with ‘Not Selected’ and their prices with 0:
replacement_df = pd.DataFrame({ 'Fruit': ['Not Selected'] * len(df), 'Price': [0] * len(df) }) filtered_df = df.where(df['Fruit'].isin(['Apple', 'Mango']), other=replacement_df) print(filtered_df)
Output:
Fruit Price 0 Apple 0.5 1 Not Selected 0.0 2 Mango 1.0 3 Not Selected 0.0 4 Not Selected 0.0
Combining where and isin with Multiple Conditions
Let’s say we want to retain fruits that are either ‘Apple’ or ‘Mango’, but only if their price is greater than 0.4:
filtered_df = df.where(df['Fruit'].isin(['Apple', 'Mango']) & (df['Price'] > 0.4)) print(filtered_df)
Output:
Fruit Price 0 Apple 0.5 1 NaN NaN 2 Mango 1.0 3 NaN NaN 4 NaN NaN
isin vs. Direct Comparison
For a single value or a small set of values, direct comparisons using logical operators (==
, !=
, &
, |
) is more efficient and straightforward.
As the number of values you want to check against grows, using multiple direct comparisons can become cumbersome and less readable. This is where isin
comes in:
fruits_list = ['Apple', 'Mango', 'Grape', 'Peach', 'Strawberry', ...] # a long list filtered_df = df[df['Fruit'].isin(fruits_list)]
Let’s perform a simple speed test to know which one works faster:
import pandas as pd import numpy as np import timeit # Create a sample dataframe np.random.seed(42) n = 10**6 # Number of rows in the DataFrame df = pd.DataFrame({ 'A': np.random.randint(0, 1000, n) }) # Some values to be checked values_to_check = list(range(500, 510)) # isin method def using_isin(): return df[df['A'].isin(values_to_check)] # Direct comparison method using chaining comparison def using_direct_comparison(): return df[(500 <= df['A']) & (df['A'] <= 509)] # Measure the performance isin_time = timeit.timeit(using_isin, number=100) chaining_comparison_time = timeit.timeit(using_direct_comparison, number=100) print(f"Time using isin: {isin_time:.5f} seconds") print(f"Time using direct comparison: {chaining_comparison_time:.5f} seconds")
Output:
Time using isin: 3.27576 seconds Time using direct comparison: 0.60838 seconds
Direct comparison is much faster!
Note that while the direct comparison is faster in this specific case (consecutive integers), it’s not always the case.
By using non-consecutive values or more complex filtering criteria, the difference is not that much:
import pandas as pd import numpy as np import timeit # Create a sample dataframe np.random.seed(42) n = 10**6 # Number of rows in the DataFrame df = pd.DataFrame({ 'A': np.random.randint(0, 1000, n) }) values_to_check = [500, 502, 505, 507, 510, 515, 520] # isin method combined with other conditions def using_isin(): even_and_greater_than_700 = (df['A'] > 700) & (df['A'] % 2 == 0) squared_ends_in_25 = (df['A']**2) % 100 == 25 return df[df['A'].isin(values_to_check) | even_and_greater_than_700 | squared_ends_in_25] # Direct comparison method using == and or combined with other conditions def using_direct_comparison(): return df[ (df['A'] == 500) | (df['A'] == 502) | (df['A'] == 505) | (df['A'] == 507) | (df['A'] == 510) | (df['A'] == 515) | (df['A'] == 520) | ((df['A'] > 700) & (df['A'] % 2 == 0)) | ((df['A']**2) % 100 == 25) ] # Measure the performance isin_time = timeit.timeit(using_isin, number=100) direct_comparison_time = timeit.timeit(using_direct_comparison, number=100) print(f"Time using isin : {isin_time:.5f} seconds") print(f"Time using direct comparison: {direct_comparison_time:.5f} seconds")
Output:
Time using isin : 6.36738 seconds Time using direct comparison: 6.16720 seconds
Direct comparison is slightly faster.
Practical Examples (Lookup Logic with isin)
Let’s say you’re analyzing a dataset of book titles. You want to categorize these titles based on a genre lookup.
Instead of writing complex logic, you can use isin
for this task.
# Sample books DataFrame books = pd.DataFrame({ 'Title': ['The Hobbit', 'War and Peace', 'The Great Gatsby', 'A Brief History of Time', 'Pride and Prejudice'], }) # Genre lookup lists fantasy_titles = ['The Hobbit', 'Harry Potter', 'Lord of the Rings'] classic_titles = ['War and Peace', 'The Great Gatsby', 'Pride and Prejudice'] # Using isin for lookup logic books['Genre'] = 'Other' books.loc[books['Title'].isin(fantasy_titles), 'Genre'] = 'Fantasy' books.loc[books['Title'].isin(classic_titles), 'Genre'] = 'Classic' print(books)
Output:
Title Genre 0 The Hobbit Fantasy 1 War and Peace Classic 2 The Great Gatsby Classic 3 A Brief History of Time Other 4 Pride and Prejudice Classic
Using isin
in combination with loc
, we’ve categorized each book based on its genre.
Resource
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.