Using loc with a MultiIndex DataFrame in Pandas
Did you know you can refine your selections when dealing with multi-level indexes using loc?
This tutorial will reveal the power of using loc with a MultiIndex DataFrame in Pandas.
- 1 Basic Usage of loc with MultiIndex
- 2 Select all Rows from Multiple Levels
- 3 Select Data Based on Values
- 4 Select Data from Sorted MultiIndex DataFrame
- 5 Apply Boolean Conditions on Index Levels
- 6 Assign Values to specific indices using MultiIndex
- 7 Select Rows and Columns from a MultiIndex DataFrame
- 8 Using Multiple Conditions for Complex Data Selection
- 9 Handling Missing Values in MultiIndex DataFrame
Basic Usage of loc with MultiIndex
Let’s first import the necessary libraries and create some sample data.
import pandas as pd import numpy as np # Set a random seed for reproducibility np.random.seed(0) # Create a MultiIndex DataFrame index = pd.MultiIndex.from_tuples([(i, j) for i in range(5) for j in range(5)]) df = pd.DataFrame(np.random.rand(25, 2), index=index) df.columns = ['A', 'B'] print(df)
Output:
A B 0 0 0.548814 0.715189 1 0.602763 0.544883 2 0.423655 0.645894 3 0.437587 0.891773 4 0.963663 0.383442 1 0 0.791725 0.528895 1 0.568045 0.925597 2 0.071036 0.087129 3 0.020218 0.832620 4 0.778157 0.870012 2 0 0.978618 0.799159 1 0.461479 0.780529 2 0.118274 0.639921 3 0.143353 0.944669 4 0.521848 0.414662 3 0 0.264556 0.774234 1 0.456150 0.568434 2 0.018790 0.617635 3 0.612096 0.616934 4 0.943748 0.681820 4 0 0.359508 0.437032 1 0.697631 0.060225 2 0.666767 0.670638 3 0.210383 0.128926 4 0.315428 0.363711
Let’s see how loc
function works when dealing with MultiIndexed DataFrame:
# Accessing a single row with loc print(df.loc[(1, 3)]) # Accessing multiple rows print(df.loc[[(1, 3), (2, 2)]])
Output:
A 0.020218 B 0.832620 Name: (1, 3), dtype: float64 A B 1 3 0.020218 0.832620 2 2 0.118274 0.639921
In the first example, you’re indexing into the DataFrame using a tuple: (1, 3)
.
This signifies you want the row with an index of 1 in the first level and an index of 3 in the second level. The output is a Series containing the specified row’s data.
In the second example, we provide a list of tuples to the loc attribute, each denoting a different row in our DataFrame.
Select all Rows from Multiple Levels
Let’s see how to select all rows from multiple levels:
# Select all rows in level 1 with index 2 and level 2 with index 1 print(df.loc[(2, 1),])
Output:
A 0.461479 B 0.780529 Name: (2, 1), dtype: float64
When we provide (2, 1)
as the key to loc, it returns the corresponding row data. The ,
(comma) represents all elements at other levels.
So with (2, 1),
, we’re selecting all rows with first level index 2 and second level index 1.
Let’s go a step further by making use of the slicing operation in Pandas.
# Select all rows where level 1 index is between 2 and 4 print(df.loc[(slice(2, 4), slice(None)), ])
Output:
A B 2 0 0.978618 0.799159 1 0.461479 0.780529 2 0.118274 0.639921 3 0.143353 0.944669 4 0.521848 0.414662 3 0 0.264556 0.774234 1 0.456150 0.568434 2 0.018790 0.617635 3 0.612096 0.616934 4 0.943748 0.681820 4 0 0.359508 0.437032 1 0.697631 0.060225 2 0.666767 0.670638 3 0.210383 0.128926 4 0.315428 0.363711
Here, we’re using slice objects to define the range of our selection. This displays data where the first level index is between 2 and 4, inclusive.
Further, slice(None)
tells Pandas to select all rows in the second level.
Select Data Based on Values
What if you want to select data based on the values?
Let’s look at how to do that:
# Select rows where A > 0.5 df_A = df.loc[df['A'] > 0.5] print(df_A)
Output:
A B 0 0 0.548814 0.715189 1 0.602763 0.544883 4 0.963663 0.383442 1 0 0.791725 0.528895 1 0.568045 0.925597 4 0.778157 0.870012 2 0 0.978618 0.799159 4 0.521848 0.414662 3 3 0.612096 0.616934 4 0.943748 0.681820 4 1 0.697631 0.060225 2 0.666767 0.670638
The output DataFrame, df_A
, includes all rows where the value in column A is greater than 0.5.
We can also add multiple conditions:
# Loc to select rows where A > 0.5 and B < 0.3 df_AB = df.loc[(df['A'] > 0.5) & (df['B'] < 0.3)] print(df_AB)
Output:
A B 4 1 0.697631 0.060225
The selection becomes more precise when we add another condition to it. Now, df_AB
includes rows where column A > 0.5 and column B < 0.3.
Multiple conditions can be combined using ‘&’ (and) or ‘|’ (or) operators.
Select Data from Sorted MultiIndex DataFrame
You may encounter unpredictable results or errors if the DataFrame is not sorted. Here’s how you can sort a MultiIndex DataFrame:
# Sort the DataFrame df_sorted = df.sort_index() print(df_sorted)
The DataFrame df_sorted
is sorted according to the index.
When the DataFrame is sorted, you can make label-based slicing operations using loc:
# Select all rows where the first level index is between 2 and 4 df_sorted_range = df_sorted.loc[(slice(2, 4), slice(None)), ] print(df_sorted_range)
Output:
A B 2 0 0.978618 0.799159 1 0.461479 0.780529 2 0.118274 0.639921 3 0.143353 0.944669 4 0.521848 0.414662 3 0 0.264556 0.774234 1 0.456150 0.568434 2 0.018790 0.617635 3 0.612096 0.616934 4 0.943748 0.681820 4 0 0.359508 0.437032 1 0.697631 0.060225 2 0.666767 0.670638 3 0.210383 0.128926 4 0.315428 0.363711
In this case, df_sorted_range
consists of rows where the first level index is between 2 and 4, inclusive, similar to a previous example.
Apply Boolean Conditions on Index Levels
Based on the above sorted DataFrame, let’s see how to apply a boolean condition on index levels:
# Select all rows with first-level index > 1 and second-level index < 3 df_bool_index = df_sorted.loc[(df_sorted.index.get_level_values(0) > 1) & (df_sorted.index.get_level_values(1) < 3)] print(df_bool_index)
Output:
A B 2 0 0.978618 0.799159 1 0.461479 0.780529 2 0.118274 0.639921 3 0 0.264556 0.774234 1 0.456150 0.568434 2 0.018790 0.617635 4 0 0.359508 0.437032 1 0.697631 0.060225 2 0.666767 0.670638
In this example, the method get_level_values(0)
and get_level_values(1)
are used to access the values of the first and second index level, respectively.
The resulting df_bool_index
contains rows with a first-level index greater than 1 and a second-level index less than 3.
Assign Values to specific indices using MultiIndex
You can use loc to set all DataFrame values based on a condition, or precisely select specific indices:
# Assign a value for a specific index df_sorted.loc[(1, 3), 'A'] = 0.999 print(df_sorted.loc[(1, 3)])
Output:
A 0.99900 B 0.83262 Name: (1, 3), dtype: float64
We used loc to set the value of column ‘A’ at index (1, 3)
to 0.999.
You can also assign values for a range of index levels like this:
# Assign a value for a range of index level df_sorted.loc[(slice(2, 4), slice(None)), 'A'] = 0.123 print(df_sorted.loc[(slice(2, 4), slice(None)), ])
Output:
A B 2 0 0.123 0.799159 1 0.123 0.780529 2 0.123 0.639921 3 0.123 0.944669 4 0.123 0.414662 3 0 0.123 0.774234 1 0.123 0.568434 2 0.123 0.617635 3 0.123 0.616934 4 0.123 0.681820 4 0 0.123 0.437032 1 0.123 0.060225 2 0.123 0.670638 3 0.123 0.128926 4 0.123 0.363711
In this example, we’ve set all values of column ‘A’ where the first level index is between 2 and 4 to 0.123.
Each A-value in the output corresponding to the specified index range now reflects our newly assigned value.
Select Rows and Columns from a MultiIndex DataFrame
You can use loc
to select data from both rows and columns simultaneously from a MultiIndex DataFrame.
# Select specific rows and columns df_row_col = df_sorted.loc[(slice(None), slice(1, 3)), 'B'] print(df_row_col)
Output:
0 1 0.544883 2 0.645894 3 0.891773 1 1 0.925597 2 0.087129 3 0.832620 2 1 0.780529 2 0.639921 3 0.944669 3 1 0.568434 2 0.617635 3 0.616934 4 1 0.060225 2 0.670638 3 0.128926 Name: B, dtype: float64
For the row selection, slice(None)
selects all rows in the first index level, and slice(1, 3)
selects rows where the second level index is between 1 and 3, inclusive.
We also selected column ‘B’ which results in a Series.
You can select multiple columns like this:
# Select all rows for multiple columns df_multi_col = df_sorted.loc[(slice(None), slice(None)), ['A', 'B']] print(df_multi_col)
Output:
A B 0 0 0.548814 0.715189 1 0.602763 0.544883 2 0.423655 0.645894 3 0.437587 0.891773 4 0.963663 0.383442 1 0 0.791725 0.528895 1 0.568045 0.925597 2 0.071036 0.087129 3 0.020218 0.832620 4 0.778157 0.870012 2 0 0.978618 0.799159 1 0.461479 0.780529 2 0.118274 0.639921 3 0.143353 0.944669 4 0.521848 0.414662 3 0 0.264556 0.774234 1 0.456150 0.568434 2 0.018790 0.617635 3 0.612096 0.616934 4 0.943748 0.681820 4 0 0.359508 0.437032 1 0.697631 0.060225 2 0.666767 0.670638 3 0.210383 0.128926 4 0.315428 0.363711
In this example, we selected all rows (slice(None)
) from both index levels, and for column selection, we provided a list of column labels, ['A', 'B']
.
Using Multiple Conditions for Complex Data Selection
You can use multiple conditions to select data from a multi-index DataFrame:
# Select rows where A > 0.5 in level 1 index 2 and B < 0.3 in level 1 index 3 df_complex = df_sorted.loc[((df_sorted.index.get_level_values(0) == 2) & (df_sorted['A'] > 0.5)) | ((df_sorted.index.get_level_values(0) == 3) & (df_sorted['B'] < 0.3))] print(df_complex)
Output:
A B 2 0 0.978618 0.799159 4 0.521848 0.414662
Here, we combined multiple conditions using the ‘pipe’ or ‘|’ character, which translates to ‘or’.
Located within the parentheses, the ‘ampersand’ or ‘&’ denotes ‘and’.
So, we are looking for rows where the first level index is 2 and the value in column A is greater than 0.5, or, rows where the first level index is 3 and the value in column B is less than 0.3.
Handling Missing Values in MultiIndex DataFrame
Let’s first create a DataFrame with missing values.
df_missing = df_sorted.copy() df_missing.loc[(2, 3), 'A'] = np.nan df_missing.loc[(4, 0), 'B'] = np.nan print(df_missing)
Output:
A B 0 0 0.548814 0.715189 1 0.602763 0.544883 2 0.423655 0.645894 3 0.437587 0.891773 4 0.963663 0.383442 1 0 0.791725 0.528895 1 0.568045 0.925597 2 0.071036 0.087129 3 0.020218 0.832620 4 0.778157 0.870012 2 0 0.978618 0.799159 1 0.461479 0.780529 2 0.118274 0.639921 3 NaN 0.944669 4 0.521848 0.414662 3 0 0.264556 0.774234 1 0.456150 0.568434 2 0.018790 0.617635 3 0.612096 0.616934 4 0.943748 0.681820 4 0 0.359508 NaN 1 0.697631 0.060225 2 0.666767 0.670638 3 0.210383 0.128926 4 0.315428 0.363711
In the above DataFrame df_missing
, we intentionally set a missing value in column ‘A’ at index (2, 3)
and column ‘B’ at index (4, 0)
.
You can use notna()
to select non-missing or missing data:
# Select rows where 'A' is not missing df_A_not_missing = df_missing.loc[df_missing['A'].notna()] print(df_A_not_missing)
Output:
A B 0 0 0.548814 0.715189 1 0.602763 0.544883 2 0.423655 0.645894 3 0.437587 0.891773 4 0.963663 0.383442 1 0 0.791725 0.528895 1 0.568045 0.925597 2 0.071036 0.087129 3 0.020218 0.832620 4 0.778157 0.870012 2 0 0.978618 0.799159 1 0.461479 0.780529 2 0.118274 0.639921 4 0.521848 0.414662 3 0 0.264556 0.774234 1 0.456150 0.568434 2 0.018790 0.617635 3 0.612096 0.616934 4 0.943748 0.681820 4 0 0.359508 NaN 1 0.697631 0.060225 2 0.666767 0.670638 3 0.210383 0.128926 4 0.315428 0.363711
The DataFrame df_A_not_missing
consists of all rows from the original DataFrame where A is not missing.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.