20+ Examples of filtering Pandas DataFrame

Data filtering has become an integral phase of data analysis wherein we have to study the data to exclude, sort, rearrange, extract specific data, or apportion data according to certain criteria.

In other words, data filtering is the process of extracting a smaller part of the larger dataset to perform specific operations on it. Pandas provides a wide range of tools, methods, and techniques to perform filtering on data.

We often perform data filtering to extract or operate on a small set of data for a particular time, calculate certain operations for a group of data as per interest, exclude bad, missing, or junk data for data cleansing, or validate statistical models.

This tutorial will give you a comprehensive idea of various data filtering techniques and methods one can use on DataFrames in Pandas.

 

 

Examples of Data filtering

In data analysis, data filtering – also known as ‘Subsetting Data,’ is an initial phase for predictive modeling or any reporting data project.

Before digging deep into the practical aspects of applying data filtering through various techniques, let us first explore some practical examples of data filtering.

  1. We can filter out data of all active customers who opened their accounts on the first day of April 2022.
  2. Extract specific details of the customer as per region, geolocation, address, etc.
  3. Fetch details of employees who have spent more than 4 years in the company.
  4. Analyze complaint data or identify customers who have complained more than five times a week using filtering.
  5. Filter out details of metro cities where per-capita income is greater than 40,000.

 

Using filter() method

It is an easy-to-use method for filtering data from a DataFrame. It returns the subset rows or columns of DataFrame as per the labels mentioned in the specified items parameter.

We can also use filter() for regex (regular expression) in Python. Its syntax is:

DataFrame.filter(self, items = [label1, label2, ….], like = None, regex = None, axis = None)

Here is a code snippet showing how to use it.

import pandas as pd
d = [['Ray', "A101", 'CTO', 530000],
     ['Gaurav', "A102", 'CMO', 410000],
     ['Dee', "A103", 'Marketing Head', 370000],
     ['Jacky', "A104", 'Security Head', 320000],
     ['Sue', "A105", 'CFO', 320000],
     ['Iris', "A106", 'IT Manager', 210000],
     ['Stefen', "A107", 'Lead Developer', 310000]]
df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
print(df.filter(items = ['Name', 'Designation']))

Output

This output shows how to use filter() method in Dataframe

 

Filtering using head() and tail()

The very basic approach to filtering the top-most or bottom-most data from a DataFrame is leveraging the head() and tail() functions.

Let’s understand what the head() and tail() do and how we can use them by passing and without passing parameters.

Using head()

The head(no) is a well-known DataFrame function of the Pandas that allows us to extract or filter out DataFrame values by returning the first set of rows specified in the parameter.

In case, the parameter is left blank, it will return the first five rows from the dataframe by default (top 5 rows).

To return a greater or lesser number of rows from the DataFrame, the number of rows needs to be specified. The code shown below is an example of how to use head().

import pandas as pd
d = [['Ray', "A101", 'CTO', 530000],
     ['Gaurav', "A102", 'CMO', 410000],
     ['Dee', "A103", 'Marketing Head', 370000],
     ['Jack', "A104", 'Security Head', 400000],
     ['Sue', "A105", 'CFO', 320000],
     ['Iris', "A106", 'IT Manager', 210000],
     ['Stephen', "A107", 'Lead Developer', 310000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
# calling head() without parameter
print(df.head())
print()	
# calling head() with parameter
print(df.head(2))

Output

This output shows how to use head() method in Dataframe

Using tail()

The tail(no) is another well-known DataFrame function of the Pandas that allows us to extract or filter out DataFrame values by returning the last set of rows specified in the parameter.

In case, the parameter is left blank, it will return the last five rows from the dataframe by default (top 5 rows).

To return a greater or lesser number of rows from the last from within the DataFrame, the number of rows needs to be specified. The code shown below is an example of how to use tail().

import pandas as pd
d = [['Ray', "A101", 'CTO', 530000],
     ['Gaurav', "A102", 'CMO', 410000],
     ['Dee', "A103", 'Marketing Head', 370000],
     ['Jack', "A104", 'Security Head', 400000],
     ['Sue', "A105", 'CFO', 320000],
     ['Iris', "A106", 'IT Manager', 210000],
     ['Stephen', "A107", 'Lead Developer', 310000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
# calling head() without parameter
print(df.tail())
print()
# calling head() with parameter
print(df.tail(2))

Output

This output shows how to use tail() method in Dataframe

 

Slicing and dicing DataFrame using .loc and .iloc

Slicing and dicing a large dataset into a smaller dataset gives us a clarification of what to use & what not to use.

It is a filtering mechanism to extract the precise value that can help in identifying a concept or getting granular insight from massive data.

In Pandas, slicing is possible using the .loc[ ] and .iloc[ ] methods.

Using .loc[]:

The loc property, used through the .loc[] helps in accessing and filtering the group of rows & columns by slicing them or returning a Boolean array.

By passing the “starting value: ending value” within the square brackets, we can specify the number of rows or columns it will slice to filter out specific data.
Syntax –

loc[row-label, column-label]
iloc[row-position, column-position]

Here is the code snippet that shows how to implement .loc[]

 import pandas as pd
a = pd.DataFrame ({'Brand': ['Royal Enfield', 'Honda', 'Orbea', 'SCOTT', 'Hero', 'BMC', 'Suzuki', 'KTM', 'Giant', 'Kona'],
					'Year': [2016, 2018, 2014, 2015, 2018, 2018, 2019, 2020, 2022, 2021],
					'Kms Driven': [22000, 34000, 23000, 20000, 25000, 24000, 23000, 26000, 29000, 30000],
					'Price': ['Rs. 1.4K', 'Rs. 1.8K', 'Rs. 2.1K', 'Rs. 1.5K', 'Rs. 0.9K', 'Rs. 1.4K','Rs. 2.2K', 'Rs. 2.1K', 'Rs. 1.45K', 'Rs. 1.65K'],
					'Mileage': [21, 22, 21, 23, 24, 21, 24, 23, 24, 27]})
# Here, we print the DataFrame
print (a.loc[3: 8])

Output

This output shows how to use slicing method in Dataframe

Using .iloc[] :

The iloc[] is an indexed-based data selecting method in Python Pandas. It helps in fetching only the specific row and column value intersecting as specified within the [] of the iloc.

In other words, the users need to pass two integers (row and column intersection – that starts from 0) as an index value within the [] to specify which row or column value they want to select specifically.
Syntax:

loc[row-number, column-number]

Here is the code snippet that shows how to implement .iloc[]

import pandas as pd
x = pd.DataFrame ({'Brand': ['Royal Enfield', 'Honda', 'Orbea', 'SCOTT', 'Hero', 'BMC', 'Suzuki', 'KTM', 'Giant', 'Kona'],
					'Year': [2016, 2018, 2014, 2015, 2018, 2018, 2019, 2020, 2022, 2021],
					'Kms Driven': [22000, 34000, 23000, 20000, 25000, 24000, 23000, 26000, 29000, 30000],
					'Price': ['Rs. 1.4K', 'Rs. 1.8K', 'Rs. 2.1K', 'Rs. 1.5K', 'Rs. 0.9K', 'Rs. 1.4K','Rs. 2.2K', 'Rs. 2.1K', 'Rs. 1.45K', 'Rs. 1.65K'],
					'Mileage': [21, 22, 21, 23, 24, 21, 24, 23, 24, 27]})
# Here, we print the DataFrame
print(x.iloc[6, 2])

Output

This output shows how to use slicing method in Dataframe

 

Filtering Pandas DataFrame using Operators

There are different operators we can use to filter specific data. We have to create those data filtering techniques through expressions with operators.

Various operators allow us to filter data from a DataFrame. Here are 2 code snippets that are showing how to implement relational operators.

Method 1: Using Relational operators

We can also use relational operators like >, >=, <, <=, etc. to perform the filtering of specific rows from a DataFrame.

Here is the code snippet that shows how to implement the relational operator.

import pandas as pd
d = [['Ray', "A101", 'CTO', 530000],
     ['Gaurav', "A102", 'CMO', 410000],
     ['Dee', "A103", 'Marketing Head', 370000],
     ['Jack', "A104", 'Security Head', 400000],
     ['Sue', "A105", 'CFO', 320000],	
     ['Iris', "A106", 'IT Manager', 210000],
     ['Stephen', "A107", 'Lead Developer', 310000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
print(df[df.Monthly_Salary > 320000])

Output

This output shows how to use filter using operators in Dataframe

The relational operator also works with strings. If the string.

import pandas as pd
d = [['Ray', "A101", 'CTO', 530000],
     ['Gaurav', "A102", 'CMO', 410000],
     ['Dee', "A103", 'Marketing Head', 370000],
     ['Jack', "A104", 'Security Head', 400000],
     ['Sue', "A105", 'CFO', 320000],
     ['Iris', "A106", 'IT Manager', 210000],
     ['Stephen', "A107", 'Lead Developer', 310000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
print(df[df.Name > 'Jack'])

Output

This output shows how to use filter using operators in Dataframe

Method 2: Using logical operators

We can also use logical operators like & and | to perform the filtering of specific rows from a DataFrame. Here is the code snippet that shows how to implement the logical operator.

import pandas as pd
d = [['Ray', "A101", 'CTO', 530000],
     ['Gaurav', "A102", 'CMO', 410000],
     ['Dee', "A103", 'Marketing Head', 370000],
     ['Jack', "A104", 'Security Head', 400000],
     ['Sue', "A105", 'CFO', 320000],
     ['Iris', "A106", 'IT Manager', 210000],
     ['Stephen', "A107", 'Lead Developer', 310000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
print(df[(df.Monthly_Salary > 320000) | (df.Name == 'Iris')])

Output

This output shows how to use filter using logical operators in Dataframe

Method 3: Using the Not logic (~)

The not logic will negate an expression or filtering mechanism so that all other conditions to extract the rows except that negated condition get fetched from the DataFrame.

Here is the code snippet that shows how to implement not logic.

import pandas as pd
d = [['Ray', "A101", 'CTO', 530000],
     ['Gaurav', "A102", 'CMO', 410000],
     ['Dee', "A103", 'Marketing Head', 370000],
     ['Jack', "A104", 'Security Head', 400000],
     ['Sue', "A105", 'CFO', 320000],
     ['Iris', "A106", 'IT Manager', 210000],
     ['Stefen', "A107", 'Lead Developer', 310000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
print(df[~df.Name.str.startswith('S')])

Output

This output shows how to use filter using NOT logic operator in Dataframe

 

Filtering using isin() method

The isin() method is another significant way of filtering data with multiple conditions applied. In other words, it can extract those data from DataFrame rows that have the specific condition mentioned within the isin() method.

Here is the code snippet that shows how to implement isin() method.

import pandas as pd
d = [['Ray', "A101", 'CTO', 530000],
     ['Gaurav', "A102", 'CMO', 410000],
     ['Dee', "A103", 'Marketing Head', 370000],
     ['Jack', "A104", 'Security Head', 320000],
     ['Sue', "A105", 'CFO', 320000],
     ['Iris', "A106", 'IT Manager', 210000],
     ['Stefen', "A107", 'Lead Developer', 310000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
print(df)
print()
f = df[df["Monthly_Salary"].isin([320000])]
print(f)

Output

This output shows how to use filter using isin() method in Dataframe

 

Filtering using str accessor

Pandas is known for its efficient textual data handling. The str accessor provides adjustable methods for filtering rows based on strings.

import pandas as pd
d = [['Ray', "A101", 'CTO', 530000],
     ['Gaurav', "A102", 'CMO', 410000],
     ['Dee', "A103", 'Marketing Head', 370000],
     ['Jacky', "A104", 'Security Head', 320000],
     ['Sue', "A105", 'CFO', 320000],
     ['Iris', "A106", 'IT Manager', 210000],
     ['Stefen', "A107", 'Lead Developer', 310000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
print(df[df.Name.str.endswith('y')])

Output

This output shows how to use filter using str accessor in Dataframe

 

Filter DataFrame using query() method

The query method provides us with clear and multi-level flexibility for extracting filtered data by composing the conditions for extracting data from DataFrame. Mostly, we pass the conditions as a string.

Here is a code snippet showing how to implement it.

import pandas as pd
d = [['Ray', "A101", 'CTO', 530000],
     ['Gaurav', "A102", 'CMO', 410000],
     ['Dee', "A103", 'Marketing Head', 370000],
     ['Jacky', "A104", 'Security Head', 320000],
     ['Sue', "A105", 'CFO', 320000],
     ['Iris', "A106", 'IT Manager', 210000],
     ['Stefen', "A107", 'Lead Developer', 310000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
print(df.query('Emp_ID == "A102" or Monthly_Salary > 400000'))

Output

This output shows how to use filter using query() method in Dataframe

 

Using nlargest or nsmallest for filtering

Often we come across situations where filtering data from a range is not what we need. There are situations like finding those rows from the DataFrame that have the largest & smallest values.

In that case, we prefer to go with specific methods like nlargest() and nsmallest().

These methods will fetch the ‘n’ largest or smallest records from the DataFrame based on the parameters passed.

The first parameter determines the number of rows to extract, while the second parameter determines based on which column’s value the data will get extracted.

  • nlargest(): Fetches the largest value-based rows as per the parameters given. The syntax is:
    nlargest(number_of_rows, column_name)

    Here is a code snippet showing the use of both.

    import pandas as pd
    d = [['Ray', "A101", 'CTO', 530000],
         ['Gaurav', "A102", 'CMO', 410000],
         ['Dee', "A103", 'Marketing Head', 370000],
         ['Jacky', "A104", 'Security Head', 320000],
         ['Sue', "A105", 'CFO', 320000],
         ['Iris', "A106", 'IT Manager', 210000],
         ['Stefen', "A107", 'Lead Developer', 310000]]
    df = df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                      index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
    print(df.nlargest(3, 'Monthly_Salary'))

    Output

    This output shows how to use filter using nlargest() method in Dataframe

  • nsmallest(): Fetches the smallest value-based rows as per the parameters given. The syntax is:
    nsmallest(number_of_rows, column_name)
    import pandas as pd
    d = [['Ray', "A101", 'CTO', 530000],
         ['Gaurav', "A102", 'CMO', 410000],
         ['Dee', "A103", 'Marketing Head', 370000],
         ['Jacky', "A104", 'Security Head', 320000],
         ['Sue', "A105", 'CFO', 320000],
         ['Iris', "A106", 'IT Manager', 210000],
         ['Stefen', "A107", 'Lead Developer', 310000]]
    df = df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                      index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
    print(df.nsmallest(3, 'Monthly_Salary'))

    Output

    This output shows how to use filter using nsmallest() method in Dataframe

 

Filtering data using the between()

The between method is another essential method that filters out a specific set of rows and columns from a DataFrame based on the data specified as a given range as its parameter. Here is a code snippet showing how to use it.

import pandas as pd
d = [['Ray', "A101", 'CTO', 530000],
     ['Gaurav', "A102", 'CMO', 410000],
     ['Dee', "A103", 'Marketing Head', 370000],
     ['Jacky', "A104", 'Security Head', 320000],
     ['Sue', "A105", 'CFO', 320000],
     ['Iris', "A106", 'IT Manager', 210000],
     ['Stefen', "A107", 'Lead Developer', 310000]]
df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
print(df[df["Monthly_Salary"].between(300000, 400000)])

Output

This output shows how to use filter using betwen() method in Dataframe

 

Filter non-missing data from DataFrame

We can filter out those rows that have missing data within a DataFrame. We can do it using the dropna() method.

It is also feasible to drop a row or multiple rows if it has 2 NaN values within it by specifying the tresh value. df.dropna(thresh = 2): This operation will drop all rows where there are at least two NaN values.

Now, we can print out the filtered DataFrame after dropping all NaN values. Here is a code snippet that shows how to use it.

import pandas as pd
d = [['Ray', 101, 'CTO', 530000],
     ['Gaurav', 102, 'CMO', 410000],
     ['Dee', 103, 'Marketing Head', 370000],
     ['Jacky', None, 'Security Head'],
     ['Sue', 105, 'CFO', 320000],
     ['Iris', 106, 'IT Manager', 210000],
     ['Stefen', 107, 'Lead Developer']]
df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
z = df.dropna()
print(z)

Output

This output shows how to use filter non-missing data in Dataframe

 

Filter missing values using the notnull() method

The notnull() method helps in detecting non-missing values for objects like Series and DataFrame. It will return a true/false value and from that, we can filter out those rows that have missing values.

Here is an example of how to use it.

import pandas as pd
d = [['Ray', 101, 'CTO', 530000],
     ['Gaurav', 102, 'CMO', 410000],
     ['Dee', 103, 'Marketing Head', 370000],
     ['Jacky', 104, 'Security Head'],
     ['Sue', 105, 'CFO', 320000],
     ['Iris', 106, 'IT Manager', 210000],
     ['Stefen', 107, 'Lead Developer']]
df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
z = df[df.Monthly_Salary.notnull()]
print(z)

Output

This output shows how to use filter notnull() method for missing data in Dataframe

 

Filtering DataFrame value using .at[]

We can use the .at[] to pin-point specific values from the DataFrame so that we can filter out what is there in that particular position or intersecting point. Here is a code showing how we can use it through a loop.

import pandas as pd
d = [['Ray', 101, 'Chief Technical Officer', 530000],
     ['Gaurav', 102, 'Chief Marketing Officer', 450000],
     ['Dee', 103, 'Marketing Head', 370000],
     ['Jacky', 104, 'Security Head', 390000],
     ['Sue', 105, 'Chief Finance Officer', 410000],
     ['Iris', 106, 'IT Manager', 210000],
     ['Stefen', 107, 'Lead Developer', 250000]]
df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = [1, 2, 3, 4, 5, 6, 7])
for row in range(1,8):
     z = df.at[row,'Designation']
     print(z)
print()
for row in range(1,8):
     z = df.at[row,'Monthly_Salary']
     print(z)

Output

This output shows how to use filter with .at[] method in Dataframe

 

Filter strings and substrings

We can also filter specific rows of a DataFrame based on the strings and substrings. For this, we can use the contains() method.

The contains() is similar to that of the LIKE in SQL. Here is a code explaining how to filter rows from a DataFrame using substrings.

import pandas as pd
d = [['Ray', 101, 'Chief Technical Officer', 530000],
     ['Gaurav', 102, 'Chief Marketing Officer', 410000],
     ['Dee', 103, 'Marketing Head', 370000],
     ['Jacky', 104, 'Security Head'],
     ['Sue', 105, 'Chief Finance Officer', 320000],
     ['Iris', 106, 'IT Manager', 210000],
     ['Stefen', 107, 'Lead Developer']]
df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
z = df[df['Designation'].str.contains("Officer")]
print(z)

Output

This output shows how to use filter strings and substrings method in Dataframe

 

Filter DataFrame by rows position and column names

We can specify the index locations followed by the column names to filter out the exact set of rows from the DataFrame.

In this filtering technique, we will need the .loc[] and .index[] to perform the operation. Here is an example showing how to use it.

import pandas as pd
d = [['Ray', 101, 'Chief Technical Officer', 530000],
     ['Gaurav', 102, 'Chief Marketing Officer', 410000],
     ['Dee', 103, 'Marketing Head', 370000],
     ['Jacky', 104, 'Security Head'],
     ['Sue', 105, 'Chief Finance Officer', 320000],
     ['Iris', 106, 'IT Manager', 210000],
     ['Stefen', 107, 'Lead Developer']]
df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
z= df.loc[df.index[0:4], ["Name", "Emp_ID"]]
print(z)

Output

This output shows how to use Filter DataFrame by rows position and column names in Dataframe

 

Use Lambda method for Filtering

Lambda is a shortcut approach to define various user-defined functions. With the help of Lambda, we can perform various operations through a single line of code. Here is an example showing how to use it.

import pandas as pd
d = [['Ray', 101, 'Chief Technical Officer', 530000],
     ['Gaurav', 102, 'Chief Marketing Officer', 450000],
     ['Dee', 103, 'Marketing Head', 370000],
     ['Jacky', 104, 'Security Head'],
     ['Sue', 105, 'Chief Finance Officer', 410000],
     ['Iris', 106, 'IT Manager', 210000],
     ['Stefen', 107, 'Lead Developer']]
df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
z= df[df.apply(lambda x: x["Designation"] == 'Chief Marketing Officer' and x["Monthly_Salary"] >= 400000, axis = 1)]
print(z)
z= df[df.apply(lambda x: x["Designation"] == 'Chief Finance Officer' and x["Monthly_Salary"] >= 400000, axis = 1)]
print(z)

Output

This output shows how to use lambda method for filtering DataFrame

 

Filtering through the list comprehension technique

List comprehension is a syntactic construct that helps to extract data from an array-type object into a list. It uses the set-builder notation of mathematics to generate a new data object from an existing one.

It is another essential way to filter out a certain set of data.

We can also perform other basic operations on those filtered data within that list comprehension. Here is an example showing how to use it.

import pandas as pd
data = {'name': ['Ray', 'Gaurav', 'Sue', 'Iris', 'Dee'],
        'year': [2018, 2019, 2022, 2020, 2021],
        'joining': [4, 2, 3, 1, 4]}
df = pd.DataFrame(data, index = ['New Delhi', 'Noida', 'Mumbai', 'Bangaluru', 'Chennai'])
print(df)
print()
z = df['next_year'] = [row+1 for row in df['year']]
print("They have/will complete one year in: ", z, " respectively")

Output

This output shows how to use comprehension technique filtering DataFrame

 

Filter DataFrame using regex (Regular expressions)

The regular expression is a special programming feature that helps us sequence the characters to form a search pattern based on strings/characters.

Regular expression helps us determine whether a string contains the specified search pattern. We can also use it to filter specific rows from a DataFrame. Here is a code snippet showing how to implement it.

import pandas as pd
d = [['Ray', 101, 'Chief Technical Officer', 530000],
     ['Gaurav', 102, 'Chief Marketing Officer', 450000],
     ['Dee', 103, 'Marketing Head', 370000],
     ['George', 104, 'Security Head', 390000],
     ['Sue', 105, 'Chief Finance Officer', 410000],
     ['Iris', 106, 'IT Manager', 210000],
     ['Stefen', 107, 'Lead Developer', 250000]]
df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = [1, 2, 3, 4, 5, 6, 7])
regex = 'G.*'
print ("Applying ", regex, " on DataFrame to extract names starting with G: \n", df[df.Name.str.match(regex)])
print()
regex = 'S.*'
print ("Applying ", regex, " on DataFrame to extract names starting with S: \n", df[df.Name.str.match(regex)])

Output

This output shows how to use regex method for filtering DataFrame

 

Visualizing Filtered data

In data science, apart from data filtering, it is also essential to visualize the data so that professionals can get a clear picture of how the data is communicating.

For data visualization, we can use the Matplotlib library. Here is a code snippet showing how to use it.

import pandas as pd
import matplotlib.pyplot as plt
d = [['Ray', 101, 'Chief Technical Officer', 530000],
     ['Gaurav', 102, 'Chief Marketing Officer', 450000],
     ['Dee', 103, 'Marketing Head', 370000],
     ['Jacky', 104, 'Security Head', 390000],
     ['Sue', 105, 'Chief Finance Officer', 410000],
     ['Iris', 106, 'IT Manager', 210000],
     ['Stefen', 107, 'Lead Developer']]
df = pd.DataFrame(d, columns = ['Name', 'Emp_ID', 'Designation', 'Monthly_Salary'],
                  index = ['A', 'B', 'C', 'D', 'E', 'F', 'G'])
df= df[df.apply(lambda x: x["Designation"] == 'Chief Marketing Officer' or x["Monthly_Salary"] >= 300000, axis = 1)]
print(df)
plt.bar(df.Name, df.Monthly_Salary, width = 0.4)
plt.show()

Output

This output shows how visualize Filtered data in DataFrame
This output shows how visualize Filtered data in DataFrame

 

Conclusion

We hope this tutorial has given a crisp idea of the various ways through which we can perform filtering data in DataFrame. It is always essential to filter data from a large dataset so that, we can precisely analyze & deal with the right set of data.

Here we have highlighted some methods that can help filter out the data based on certain conditions. We have also used various operators, techniques, and expressions to perform data filtering.

Lastly, we have seen how to visualize the filtered data of a DataFrame using a visualization library.

 

Further Reading

For further reading about filtering DataFrames, check this:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.filter.html

 

Leave a Reply

Your email address will not be published. Required fields are marked *