Using loc with a MultiIndex DataFrame in Pandas

Did you know you can refine your selections when dealing with multi-level indexes using loc?

This tutorial will reveal the power of using loc with a MultiIndex DataFrame in Pandas.

 

 

Basic Usage of loc with MultiIndex

Let’s first import the necessary libraries and create some sample data.

import pandas as pd
import numpy as np

# Set a random seed for reproducibility
np.random.seed(0)

# Create a MultiIndex DataFrame
index = pd.MultiIndex.from_tuples([(i, j) for i in range(5) for j in range(5)])
df = pd.DataFrame(np.random.rand(25, 2), index=index)
df.columns = ['A', 'B']
print(df)

Output:

            A         B
0 0  0.548814  0.715189
  1  0.602763  0.544883
  2  0.423655  0.645894
  3  0.437587  0.891773
  4  0.963663  0.383442
1 0  0.791725  0.528895
  1  0.568045  0.925597
  2  0.071036  0.087129
  3  0.020218  0.832620
  4  0.778157  0.870012
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
  3  0.143353  0.944669
  4  0.521848  0.414662
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
  3  0.612096  0.616934
  4  0.943748  0.681820
4 0  0.359508  0.437032
  1  0.697631  0.060225
  2  0.666767  0.670638
  3  0.210383  0.128926
  4  0.315428  0.363711

Let’s see how loc function works when dealing with MultiIndexed DataFrame:

# Accessing a single row with loc
print(df.loc[(1, 3)])

# Accessing multiple rows
print(df.loc[[(1, 3), (2, 2)]])

Output:

A    0.020218
B    0.832620
Name: (1, 3), dtype: float64

            A         B
1 3  0.020218  0.832620
2 2  0.118274  0.639921

In the first example, you’re indexing into the DataFrame using a tuple: (1, 3).

This signifies you want the row with an index of 1 in the first level and an index of 3 in the second level. The output is a Series containing the specified row’s data.

In the second example, we provide a list of tuples to the loc attribute, each denoting a different row in our DataFrame.

 

Select all Rows from Multiple Levels

Let’s see how to select all rows from multiple levels:

# Select all rows in level 1 with index 2 and level 2 with index 1
print(df.loc[(2, 1),])

Output:

A    0.461479
B    0.780529
Name: (2, 1), dtype: float64

When we provide (2, 1) as the key to loc, it returns the corresponding row data. The , (comma) represents all elements at other levels.

So with (2, 1),, we’re selecting all rows with first level index 2 and second level index 1.

Let’s go a step further by making use of the slicing operation in Pandas.

# Select all rows where level 1 index is between 2 and 4
print(df.loc[(slice(2, 4), slice(None)), ])

Output:

            A         B
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
  3  0.143353  0.944669
  4  0.521848  0.414662
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
  3  0.612096  0.616934
  4  0.943748  0.681820
4 0  0.359508  0.437032
  1  0.697631  0.060225
  2  0.666767  0.670638
  3  0.210383  0.128926
  4  0.315428  0.363711

Here, we’re using slice objects to define the range of our selection. This displays data where the first level index is between 2 and 4, inclusive.

Further, slice(None) tells Pandas to select all rows in the second level.

 

Select Data Based on Values

What if you want to select data based on the values?

Let’s look at how to do that:

# Select rows where A > 0.5
df_A = df.loc[df['A'] > 0.5]
print(df_A)

Output:

            A         B
0 0  0.548814  0.715189
  1  0.602763  0.544883
  4  0.963663  0.383442
1 0  0.791725  0.528895
  1  0.568045  0.925597
  4  0.778157  0.870012
2 0  0.978618  0.799159
  4  0.521848  0.414662
3 3  0.612096  0.616934
  4  0.943748  0.681820
4 1  0.697631  0.060225
  2  0.666767  0.670638

The output DataFrame, df_A, includes all rows where the value in column A is greater than 0.5.

We can also add multiple conditions:

# Loc to select rows where A > 0.5 and B < 0.3
df_AB = df.loc[(df['A'] > 0.5) & (df['B'] < 0.3)]
print(df_AB)

Output:

            A         B
4 1  0.697631  0.060225

The selection becomes more precise when we add another condition to it. Now, df_AB includes rows where column A > 0.5 and column B < 0.3.

Multiple conditions can be combined using ‘&’ (and) or ‘|’ (or) operators.

 

Select Data from Sorted MultiIndex DataFrame

You may encounter unpredictable results or errors if the DataFrame is not sorted. Here’s how you can sort a MultiIndex DataFrame:

# Sort the DataFrame
df_sorted = df.sort_index()
print(df_sorted)

The DataFrame df_sorted is sorted according to the index.

When the DataFrame is sorted, you can make label-based slicing operations using loc:

# Select all rows where the first level index is between 2 and 4
df_sorted_range = df_sorted.loc[(slice(2, 4), slice(None)), ]
print(df_sorted_range)

Output:

            A         B
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
  3  0.143353  0.944669
  4  0.521848  0.414662
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
  3  0.612096  0.616934
  4  0.943748  0.681820
4 0  0.359508  0.437032
  1  0.697631  0.060225
  2  0.666767  0.670638
  3  0.210383  0.128926
  4  0.315428  0.363711

In this case, df_sorted_range consists of rows where the first level index is between 2 and 4, inclusive, similar to a previous example.

 

Apply Boolean Conditions on Index Levels

Based on the above sorted DataFrame, let’s see how to apply a boolean condition on index levels:

# Select all rows with first-level index > 1 and second-level index < 3
df_bool_index = df_sorted.loc[(df_sorted.index.get_level_values(0) > 1) & 
                              (df_sorted.index.get_level_values(1) < 3)]
print(df_bool_index)

Output:

            A         B
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
4 0  0.359508  0.437032
  1  0.697631  0.060225
  2  0.666767  0.670638

In this example, the method get_level_values(0) and get_level_values(1) are used to access the values of the first and second index level, respectively.

The resulting df_bool_index contains rows with a first-level index greater than 1 and a second-level index less than 3.

 

Assign Values to specific indices using MultiIndex

You can use loc to set all DataFrame values based on a condition, or precisely select specific indices:

# Assign a value for a specific index
df_sorted.loc[(1, 3), 'A'] = 0.999
print(df_sorted.loc[(1, 3)])

Output:

A    0.99900
B    0.83262
Name: (1, 3), dtype: float64

We used loc to set the value of column ‘A’ at index (1, 3) to 0.999.

You can also assign values for a range of index levels like this:

# Assign a value for a range of index level
df_sorted.loc[(slice(2, 4), slice(None)), 'A'] = 0.123
print(df_sorted.loc[(slice(2, 4), slice(None)), ])

Output:

         A         B
2 0  0.123  0.799159
  1  0.123  0.780529
  2  0.123  0.639921
  3  0.123  0.944669
  4  0.123  0.414662
3 0  0.123  0.774234
  1  0.123  0.568434
  2  0.123  0.617635
  3  0.123  0.616934
  4  0.123  0.681820
4 0  0.123  0.437032
  1  0.123  0.060225
  2  0.123  0.670638
  3  0.123  0.128926
  4  0.123  0.363711

In this example, we’ve set all values of column ‘A’ where the first level index is between 2 and 4 to 0.123.

Each A-value in the output corresponding to the specified index range now reflects our newly assigned value.

 

Select Rows and Columns from a MultiIndex DataFrame

You can use loc to select data from both rows and columns simultaneously from a MultiIndex DataFrame.

# Select specific rows and columns
df_row_col = df_sorted.loc[(slice(None), slice(1, 3)), 'B']
print(df_row_col)

Output:

0  1    0.544883
   2    0.645894
   3    0.891773
1  1    0.925597
   2    0.087129
   3    0.832620
2  1    0.780529
   2    0.639921
   3    0.944669
3  1    0.568434
   2    0.617635
   3    0.616934
4  1    0.060225
   2    0.670638
   3    0.128926
Name: B, dtype: float64

For the row selection, slice(None) selects all rows in the first index level, and slice(1, 3) selects rows where the second level index is between 1 and 3, inclusive.

We also selected column ‘B’ which results in a Series.

You can select multiple columns like this:

# Select all rows for multiple columns
df_multi_col = df_sorted.loc[(slice(None), slice(None)), ['A', 'B']]
print(df_multi_col)

Output:

            A         B
0 0  0.548814  0.715189
  1  0.602763  0.544883
  2  0.423655  0.645894
  3  0.437587  0.891773
  4  0.963663  0.383442
1 0  0.791725  0.528895
  1  0.568045  0.925597
  2  0.071036  0.087129
  3  0.020218  0.832620
  4  0.778157  0.870012
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
  3  0.143353  0.944669
  4  0.521848  0.414662
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
  3  0.612096  0.616934
  4  0.943748  0.681820
4 0  0.359508  0.437032
  1  0.697631  0.060225
  2  0.666767  0.670638
  3  0.210383  0.128926
  4  0.315428  0.363711

In this example, we selected all rows (slice(None)) from both index levels, and for column selection, we provided a list of column labels, ['A', 'B'].

 

Using Multiple Conditions for Complex Data Selection

You can use multiple conditions to select data from a multi-index DataFrame:

# Select rows where A > 0.5 in level 1 index 2 and B < 0.3 in level 1 index 3
df_complex = df_sorted.loc[((df_sorted.index.get_level_values(0) == 2) & (df_sorted['A'] > 0.5)) |
                           ((df_sorted.index.get_level_values(0) == 3) & (df_sorted['B'] < 0.3))]
print(df_complex)

Output:

            A         B
2 0  0.978618  0.799159
  4  0.521848  0.414662

Here, we combined multiple conditions using the ‘pipe’ or ‘|’ character, which translates to ‘or’.

Located within the parentheses, the ‘ampersand’ or ‘&’ denotes ‘and’.

So, we are looking for rows where the first level index is 2 and the value in column A is greater than 0.5, or, rows where the first level index is 3 and the value in column B is less than 0.3.

 

Handling Missing Values in MultiIndex DataFrame

Let’s first create a DataFrame with missing values.

df_missing = df_sorted.copy()
df_missing.loc[(2, 3), 'A'] = np.nan
df_missing.loc[(4, 0), 'B'] = np.nan
print(df_missing)

Output:

            A         B
0 0  0.548814  0.715189
  1  0.602763  0.544883
  2  0.423655  0.645894
  3  0.437587  0.891773
  4  0.963663  0.383442
1 0  0.791725  0.528895
  1  0.568045  0.925597
  2  0.071036  0.087129
  3  0.020218  0.832620
  4  0.778157  0.870012
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
  3       NaN  0.944669
  4  0.521848  0.414662
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
  3  0.612096  0.616934
  4  0.943748  0.681820
4 0  0.359508       NaN
  1  0.697631  0.060225
  2  0.666767  0.670638
  3  0.210383  0.128926
  4  0.315428  0.363711

In the above DataFrame df_missing, we intentionally set a missing value in column ‘A’ at index (2, 3) and column ‘B’ at index (4, 0).

You can use notna() to select non-missing or missing data:

# Select rows where 'A' is not missing
df_A_not_missing = df_missing.loc[df_missing['A'].notna()]
print(df_A_not_missing)

Output:

            A         B
0 0  0.548814  0.715189
  1  0.602763  0.544883
  2  0.423655  0.645894
  3  0.437587  0.891773
  4  0.963663  0.383442
1 0  0.791725  0.528895
  1  0.568045  0.925597
  2  0.071036  0.087129
  3  0.020218  0.832620
  4  0.778157  0.870012
2 0  0.978618  0.799159
  1  0.461479  0.780529
  2  0.118274  0.639921
  4  0.521848  0.414662
3 0  0.264556  0.774234
  1  0.456150  0.568434
  2  0.018790  0.617635
  3  0.612096  0.616934
  4  0.943748  0.681820
4 0  0.359508       NaN
  1  0.697631  0.060225
  2  0.666767  0.670638
  3  0.210383  0.128926
  4  0.315428  0.363711

The DataFrame df_A_not_missing consists of all rows from the original DataFrame where A is not missing.

Leave a Reply

Your email address will not be published. Required fields are marked *