Selecting Data by Multiple Conditions using Pandas loc

Pandas offers loc property, a label-based data selection method that allows you to select data based on multiple conditions.

This tutorial aims to guide you on using the Pandas loc function under multiple conditions.

 

 

Understanding Logical Operators for Multiple Conditions

Logical operators in Python, such as ‘and’, ‘or’, and ‘not’, allow combining simple conditions to form complex conditional statements.

Logical Operators

  • ‘and’: Returns True if both conditions are true
  • ‘or’: Returns True if at least one condition is true
  • ‘not’: Returns False if the condition is true (reverses the condition)

Here’s a quick run-through:

print(True and False)
print(True or False)
print(not True)

Output:

False
True
False

In the code above, the ‘and’ operator returns False because both conditions (True and False) are not True.

The ‘or’ operator returns True because at least one of the conditions (True or False) is True.

Lastly, the ‘not’ operator returns False because the condition is True and ‘not’ reverses the condition.

 

Combining Conditions with AND operator (&)

Let’s now dive into combining conditions with the AND operator shown as & in Pandas.

We will start with a simple DataFrame example:

import pandas as pd
data = {'Name': ['Tom', 'Nick', 'John', 'Tom'], 'Age': [20, 21, 19, 18], 'Grade': ['A', 'B', 'A', 'C']}
df = pd.DataFrame(data)
print(df)

Output:

  Name  Age Grade
0  Tom   20     A
1 Nick   21     B
2 John   19     A
3  Tom   18     C

In this DataFrame, we have ‘Name’, ‘Age’, and ‘Grade’ as columns.

Now, let’s say your task is to fetch the rows where ‘Age’>19 and ‘Grade’=’A’. Here’s how you can do it:

print(df.loc[(df['Age'] > 19) & (df['Grade'] == 'A')])

Output:

  Name  Age Grade
0  Tom   20     A

Using the loc function and the ‘and’ operator (&), the code fetches the rows satisfying both conditions: ‘Age’ > 19 and ‘Grade’ equals ‘A’.

 

Combining Conditions with OR operator (|)

Referring again to our previous DataFrame, this time let’s fetch rows where ‘Age’ > 19 or ‘Grade’=’A’.

print(df.loc[(df['Age'] > 19) | (df['Grade'] == 'A')])

Output:

   Name  Age Grade
0   Tom   20     A
1  Nick   21     B
2  John   19     A

In the above code, we’ve used the loc function with the OR operator (|). This returns the rows fulfilling at least one of the conditions: ‘Age’>19 or ‘Grade’ equals ‘A’.

 

Using Multiple Conditions with NOT operator (~)

The NOT operator is represented as ~ in Pandas. The NOT operator is useful if you want to get the reverse of a condition.

Sticking to our DataFrame from before, let’s get the rows where ‘Grade’ is NOT ‘A’.

print(df.loc[~(df['Grade'] == 'A')])

Output:

   Name  Age Grade
1  Nick   21     B
3   Tom   18     C

This code uses the loc function with the NOT operator (~) to return the rows where ‘Grade’ is not ‘A’. As you can see, row 0 and row 2 are excluded in the output as they have ‘Grade’ equals ‘A’.

 

Combining More Than Two Conditions

Let’s use our previous DataFrame and fetch the rows where ‘Age’ is greater than 19, ‘Grade’ is ‘A’, and ‘Name’ is ‘Tom’.

print(df.loc[(df['Age'] > 19) & (df['Grade'] == 'A') & (df['Name'] == 'Tom')])

Output:

  Name  Age Grade
0  Tom   20     A

In the above code, we’ve extended the conditions in our loc function to use three conditions instead of two.

 

Multiple Conditions with a Specific Column

Let’s see how we can select a specific column with multiple conditions using Pandas loc.

We take data from previous examples:

print(df.loc[(df['Grade'] == 'A') & (df['Age'] > 19), 'Name'])

Output:

0    Tom
Name: Name, dtype: object

In the code snippet, we’ve used loc to select all rows where ‘Grade’ equals ‘A’ and ‘Age’ is more than 19.

Furthermore, by appending 'Name' at the end of the loc function, it selects only the ‘Name’ column from those rows that fulfill the conditions.

 

Select Multiple Columns with Multiple Conditions

Let’s see how you can select multiple columns with multiple conditions with loc.

print(df.loc[(df['Age'] > 19), ['Name', 'Grade']])

Output:

   Name Grade
0  Tom     A
1 Nick    B

In the above code, we’ve used ‘loc’ to select all rows where ‘Age’ is more than 19.

Additionally, by adding a list, ['Name', 'Grade'], it selects only the ‘Name’ and ‘Grade’ columns from the rows that fulfill the conditions.

 

Handling Null Values Using loc and Multiple Conditions

To give you an illustration, let’s modify our DataFrame to include null values:

import numpy as np
data = {'Name': ['Tom', 'Nick', 'John', 'Tom'], 'Age': [20, 21, 19, np.nan], 'Grade': ['A', 'B', 'A', 'C']}
df = pd.DataFrame(data)
print(df)

Output:

  Name   Age Grade
0  Tom  20.0     A
1 Nick  21.0     B
2 John  19.0     A
3  Tom   NaN     C

Here, you see the ‘Age’ of the last entry (‘Tom’) is NaN (Not a Number), indicating a null or missing value.

Now, let’s fetch the rows where ‘Age’ is not null and ‘Grade’ equals ‘A’.

print(df.loc[(df['Grade'] == 'A') & (df['Age'].notna())])

Output:

   Name   Age Grade
0   Tom  20.0     A
2  John  19.0     A

This code uses the notna() function in combination with loc and logical AND to select only those rows where ‘Grade’ is ‘A’ and ‘Age’ is not null.

 

Using loc with Multiple Conditions for Date/Time Data

Let’s start with a simple DataFrame with date/time as one of the columns.

data = {'Date': pd.date_range(start='1/1/2020', end='1/10/2020'), 'Value': range(10)}
df = pd.DataFrame(data)

print(df)

Output:

        Date  Value
0 2020-01-01      0
1 2020-01-02      1
2 2020-01-03      2
3 2020-01-04      3
4 2020-01-05      4
5 2020-01-06      5
6 2020-01-07      6
7 2020-01-08      7
8 2020-01-09      8
9 2020-01-10      9

To select all rows where ‘Date’ is later than January 3, 2020 and ‘Value’ is greater than 5, you can use the following code:

print(df.loc[(df['Date'] > '2020-01-03') & (df['Value'] > 5)])

Output:

        Date  Value
6 2020-01-07      6
7 2020-01-08      7
8 2020-01-09      8
9 2020-01-10      9

The loc function, in combination with the logical AND operator, filters the DataFrame for rows where ‘Date’ is after ‘2020-01-03’ and ‘Value’ is more than 5.

 

Using loc with Multiple Conditions for Numerical Data

Let’s consider a simple DataFrame where we have numerical data.

data = {'Column1': [15, 20, 35, 40, 55], 'Column2': [100, 200, 300, 400, 500]}
df = pd.DataFrame(data)

print(df)

Output:

   Column1  Column2
0       15      100
1       20      200
2       35      300
3       40      400
4       55      500

To select rows where ‘Column1’ is greater than 20 and ‘Column2’ is less than 500, you can use the following code:

print(df.loc[(df['Column1'] > 20) & (df['Column2'] < 500)])

Output:

   Column1  Column2
2       35      300
3       40      400

The loc function, with the logical AND operator (&), filters the DataFrame for rows where ‘Column1’ is greater than 20 and ‘Column2’ is less than 500.

 

Using loc with Multiple Conditions for Categorical Data

Consider this DataFrame:

data = {'Name': ['Tom', 'Nick', 'John', 'Tom'], 'Color': ['Blue', 'Green', 'Red', 'Blue'], 'Number': [1, 2, 3, 4]}
df = pd.DataFrame(data)
print(df)

Output:

   Name  Color  Number
0   Tom   Blue       1
1  Nick  Green       2
2  John    Red       3
3   Tom   Blue       4

Let’s display the rows where ‘Color’ is ‘Blue’ and ‘Number’ is greater than 1.

print(df.loc[(df['Color'] == 'Blue') & (df['Number'] > 1)])

Output:

  Name Color  Number
3  Tom  Blue       4

In the code above, loc combined with the AND operator filters DataFrame for rows where ‘Color’ is ‘Blue’ and ‘Number’ is more than 1.

 

Updating DataFrame Based on Multiple Conditions

Let’s go back to our DataFrame:

data = {'Name': ['Tom', 'Nick', 'John', 'Tom'], 'Age': [20, 21, 19, 18], 'Grade': ['A', 'B', 'A', 'C']}
df = pd.DataFrame(data)

print(df)

Output:

  Name  Age Grade
0  Tom   20     A
1 Nick   21     B
2 John   19     A
3  Tom   18     C

Let’s say you want to change the ‘Grade’ from ‘B’ to ‘A’ for those with ‘Age’ > 19. Here’s how you can do that:

df.loc[(df['Age'] > 19) & (df['Grade'] == 'B'), 'Grade'] = 'A'
print(df)

Output:

  Name  Age Grade
0  Tom   20     A
1 Nick   21     A
2 John   19     A
3  Tom   18     C

In the above code, loc has selected the rows where ‘Age’ > 19 and ‘Grade’ is ‘B’. It then updates the ‘Grade’ to ‘A’.

 

Real-world Example

Back in my days at the telecom company in Egypt, analyzing large datasets was a regular part of my job.

We gathered millions of data points from our numerous customers, extending from their call history, data usage, top-up patterns, and complaint logs.

Once I had a task to identify the subscribers who used more than 5GB data, placed more than 1000 minutes of calls, and registered at least three complaints in the last six months.

By using Pandas loc with multiple conditions, the task mentioned above became a matter of just a few lines of code!

target_subscribers = df.loc[(df['Data Usage'] > 5) & (df['Call Minutes'] > 1000) & (df['Complaints'] >= 3)]

This was extremely beneficial when the network team needed to explore different thresholds to identify potential service improvement areas.

To conclude, using Pandas loc with multiple conditions proves invaluable in many data analysis operations, allowing you to filter, select, or change data in versatile and efficient ways.

Leave a Reply

Your email address will not be published. Required fields are marked *