Filtering Multi-Indexed DataFrames Using Pandas isin

One feature that proves extremely useful is the isin() method, especially when dealing with MultiIndex DataFrames.

This tutorial will walk you through various examples to show how you can use the isin() method with MultiIndex DataFrames.

 

 

Applying isin on MultiIndex DataFrame: Row-wise

In this section, you’ll learn how to apply the isin() method on a MultiIndex DataFrame, focusing on row-wise filtering.

The goal is to filter rows based on multiple conditions.

First, let’s create a sample MultiIndex DataFrame:

import pandas as pd
data = {
    'AccountID': [101, 101, 102, 103, 103, 104],
    'Timestamp': ['2021-01-01', '2021-01-02', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-01'],
    'Data_Usage': [1.2, 3.1, 2.4, 2.9, 3.5, 1.8],
    'Call_Duration': [5, 15, 7, 9, 12, 8]
}
df = pd.DataFrame(data)

# Set AccountID and Timestamp as MultiIndex
df.set_index(['AccountID', 'Timestamp'], inplace=True)
print(df)

Output:

                    Data_Usage  Call_Duration
AccountID Timestamp                          
101       2021-01-01        1.2             5
          2021-01-02        3.1            15
102       2021-01-01        2.4             7
103       2021-01-01        2.9             9
          2021-01-02        3.5            12
104       2021-01-01        1.8             8

Now, let’s say you want to filter this DataFrame to include only the records for account IDs 101 and 103 and timestamps ‘2021-01-01’ and ‘2021-01-02’.

You can do this using the isin() method.

filtered_df = df[df.index.isin([(101, '2021-01-01'), (103, '2021-01-02')])]
print(filtered_df)

Output:

                    Data_Usage  Call_Duration
AccountID Timestamp                          
101       2021-01-01        1.2             5
103       2021-01-02        3.5            12

By applying the isin() method on the index, you’ve successfully filtered the DataFrame to only include records for the specified account IDs and timestamps.

 

Applying isin on MultiIndex DataFrame: Column-wise

In this part, you’ll delve into column-wise filtering using the isin() method on MultiIndex DataFrames.

The idea here is to focus on filtering specific columns rather than rows. First, let’s extend the previous DataFrame to include another level of columns.

data = {
    ('Data_Usage', '4G'): [1.2, 3.1, 2.4, 2.9, 3.5, 1.8],
    ('Data_Usage', '5G'): [0.5, 0.2, 0.7, 0.4, 0.8, 0.3],
    ('Call_Duration', 'Local'): [5, 15, 7, 9, 12, 8],
    ('Call_Duration', 'International'): [0, 2, 1, 3, 4, 1]
}

index = pd.MultiIndex.from_tuples(
    [(101, '2021-01-01'), (101, '2021-01-02'), (102, '2021-01-01'),
     (103, '2021-01-01'), (103, '2021-01-02'), (104, '2021-01-01')],
    names=['AccountID', 'Timestamp']
)
df = pd.DataFrame(data, index=index)
print(df)

Output:

                    Data_Usage      Call_Duration             
                            4G   5G         Local International
AccountID Timestamp                                            
101       2021-01-01       1.2  0.5            5             0
          2021-01-02       3.1  0.2           15             2
102       2021-01-01       2.4  0.7            7             1
103       2021-01-01       2.9  0.4            9             3
          2021-01-02       3.5  0.8           12             4
104       2021-01-01       1.8  0.3            8             1

This MultiIndex DataFrame has two levels for columns: one for the type of data (Data_Usage and Call_Duration) and another for the technology used (4G, 5G) or call type (Local, International).

Using isin to filter specific columns

Suppose you’re interested in only the columns related to Data_Usage for 4G and Call_Duration for Local calls. Here’s how to do it:

filtered_df = df.loc[:, df.columns.isin([('Data_Usage', '4G'), ('Call_Duration', 'Local')])]
print(filtered_df)

Output:

                    Data_Usage Call_Duration
                            4G         Local
AccountID Timestamp                         
101       2021-01-01       1.2            5
          2021-01-02       3.1           15
102       2021-01-01       2.4            7
103       2021-01-01       2.9            9
          2021-01-02       3.5           12
104       2021-01-01       1.8            8

 

Applying isin on MultiIndex Series

Applying the isin() method on a MultiIndex Series is similar to applying it on a MultiIndex DataFrame.

First, we’ll extract a MultiIndex Series from our sample DataFrame:

data_usage_4g_series = df[('Data_Usage', '4G')]
print(data_usage_4g_series)

Output:

AccountID  Timestamp 
101        2021-01-01    1.2
           2021-01-02    3.1
102        2021-01-01    2.4
103        2021-01-01    2.9
           2021-01-02    3.5
104        2021-01-01    1.8
Name: (Data_Usage, 4G), dtype: float64

If you want to filter this Series to include only the data usage for account IDs 101 and 103, you can use the isin() method like this:

filtered_series = data_usage_4g_series[data_usage_4g_series.index.isin([(101, '2021-01-01'), (103, '2021-01-02')])]
print(filtered_series)

Output:

AccountID  Timestamp 
101        2021-01-01    1.2
103        2021-01-02    3.5
Name: (Data_Usage, 4G), dtype: float64

After applying the isin() method, the filtered Series now includes only the records for account IDs 101 and 103 for the specified timestamps.

This shows that the isin() method can filter MultiIndex Series, just like it is for MultiIndex DataFrames.

Leave a Reply

Your email address will not be published. Required fields are marked *