Using Pandas DataFrame loc Property for Label Based Access

The loc property in Pandas allows for label-based indexing, which means that you use labels (like column names and row indices) to access particular rows and columns of a DataFrame.

Unlike iloc, which uses integer-based indexing, loc is primarily label-based, making it capable of working with data that has meaningful labels.

 

 

Pandas DataFrame loc Syntax

The loc property of Pandas DataFrames provides an interface to access a group of rows and columns by labels or a boolean array.

Here’s the basic syntax:

df.loc[rows, columns]

Rows: This can be a single label, a list of labels, a slice object with labels ('start':'end'), or a boolean array.

Columns: This can be a single label, a list of labels, a slice object with labels ('start':'end'), or a boolean array.

 

Select a Single Row by Its Label

Here’s a simple DataFrame to demonstrate:

import pandas as pd
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Paris']
}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35     Paris

For the purpose of demonstration, let’s first set the ‘Name’ column as the index for our DataFrame.

This will allow us to use the names as labels.

df.set_index('Name', inplace=True)
print(df)

Output:

         Age      City
Name                  
Alice     25  New York
Bob       30    London
Charlie   35     Paris

Now, let’s select the row for ‘Bob’ using the loc property:

# Selecting the row with label 'Bob'
selected_row = df.loc['Bob']
print(selected_row)

Output:

Age         30
City    London
Name: Bob, dtype: object

Here, we’ve accessed the row with the label 'Bob' which corresponds to the data for Bob in our DataFrame.

The result is a Pandas Series with the data for that specific row.

 

Select Multiple Rows Using a List of Labels

The loc property allows you to select multiple rows simultaneously based on their labels by accepting a list of labels.

Based on the previously modified DataFrame (where ‘Name’ is the index), let’s select rows for both ‘Alice’ and ‘Charlie’.

selected_rows = df.loc[['Alice', 'Charlie']]
print(selected_rows)

Output:

         Age      City
Name                  
Alice     25  New York
Charlie   35     Paris

It’s important to be aware that the order in which you specify the labels in the list determines the order of the rows in the resulting DataFrame.

So, if you’d reversed the list to ['Charlie', 'Alice'], Charlie’s data would have appeared before Alice’s in the output.

 

Select a Range of Rows Using Label Slicing

When working with a Pandas DataFrame, you can utilize label slicing with the loc property to retrieve a consecutive range of rows based on their labels.

Unlike Python slicing, label slicing in Pandas is inclusive of the stop label.

Using our name-indexed DataFrame, let’s extract a range of rows from ‘Alice’ through ‘Charlie’.

# Selecting a range of rows from 'Alice' to 'Charlie'
sliced_rows = df.loc['Alice':'Charlie']
print(sliced_rows)

Output:

         Age      City
Name                  
Alice     25  New York
Bob       30    London
Charlie   35     Paris

In this example, the slice 'Alice':'Charlie' has selected all rows starting from ‘Alice’ and ending with ‘Charlie’, inclusive of both endpoints.

 

Select a Single Column Using Its Label Along With All Rows

loc also offers column selection capabilities.

You can extract specific data sections from your DataFrame by specifying the rows and columns you want to select.

When you want to select all rows but only a specific column, you can use a colon (:) for the rows segment.

Let’s retrieve the ‘Age’ column for all rows from our DataFrame:

# Selecting the 'Age' column for all rows
age_column = df.loc[:, 'Age']
print(age_column)

Output:

Name
Alice      25
Bob        30
Charlie    35
Name: Age, dtype: int64

The result is a Pandas Series, where the indices are preserved from the original DataFrame and the data corresponds to the values from the ‘Age’ column.

 

Select Multiple Columns Using a List of Their Labels

You can extract the data from specific columns across all rows by specifying a list of column labels with the loc property.

Using our current DataFrame as an example, let’s select the columns ‘Age’ and ‘City’ for all rows:

selected_columns = df.loc[:, ['Age', 'City']]
print(selected_columns)

Output:

         Age      City
Name                  
Alice     25  New York
Bob       30    London
Charlie   35     Paris

Here, df.loc[:, ['Age', 'City']] fetches all rows (indicated by the colon :) and only the columns ‘Age’ and ‘City’.

 

Select a Range of Columns Using Label Slicing

Just as you can slice rows based on their labels, you can slice columns in a similar way with the loc property.

For our demonstration, let’s first rearrange our columns to have an order, then select a range.

# Rearrange columns for demonstration
df = df[['City', 'Age']]
print(df)

Output:

              City  Age
Name                   
Alice     New York   25
Bob         London   30
Charlie      Paris   35

Now, we will slice to select columns from ‘City’ to ‘Age’:

# Selecting a range of columns from 'City' to 'Age'
sliced_columns = df.loc[:, 'City':'Age']
print(sliced_columns)

Output:

              City  Age
Name                   
Alice     New York   25
Bob         London   30
Charlie      Paris   35

The slice 'City':'Age' selects all columns starting from ‘City’ and ending with ‘Age’, inclusive of both endpoints.

 

Select Specific Rows and Specific Columns Using Labels

Using the loc property, you can combine row and column selection to extract specific cells or sections of your DataFrame based on both row and column labels.

Let’s use our existing DataFrame to select data for ‘Alice’ and ‘Charlie’ from the ‘City’ column:

specific_data = df.loc[['Alice', 'Charlie'], 'City']
print(specific_data)

Output:

Name
Alice     New York
Charlie      Paris
Name: City, dtype: object

In this example, df.loc[['Alice', 'Charlie'], 'City'] returns a Pandas Series with the ‘City’ values for ‘Alice’ and ‘Charlie’.

But, what if we want to select multiple columns for these specific rows? It’s just as straightforward:

# Selecting 'City' and 'Age' data for 'Alice' and 'Charlie'
specific_data_multi = df.loc[['Alice', 'Charlie'], ['City', 'Age']]
print(specific_data_multi)

Output:

              City  Age
Name                   
Alice     New York   25
Charlie      Paris   35

 

Select a Range of Rows and Specific Columns

Using our existing DataFrame, let’s select the range of rows from ‘Alice’ to ‘Bob’ for the columns ‘City’ and ‘Age’:

range_specific_columns = df.loc['Alice':'Bob', ['City', 'Age']]
print(range_specific_columns)

Output:

         City  Age
Name              
Alice  New York   25
Bob      London   30

The result is a new DataFrame segment that has the sliced rows and the designated columns.

 

Select Specific Rows and a Range of Columns

Inversely to our previous section, you can also select specific rows while slicing a range of columns using their labels.

Continuing with our current DataFrame, let’s retrieve data for the rows ‘Alice’ and ‘Charlie’ across the range of columns from ‘City’ to ‘Age’:

rows_range_columns = df.loc[['Alice', 'Charlie'], 'City':'Age']
print(rows_range_columns)

Output:

              City  Age
Name                   
Alice     New York   25
Charlie      Paris   35

In this example, df.loc[['Alice', 'Charlie'], 'City':'Age'] selects the specified rows ‘Alice’ and ‘Charlie’, and the column range starting from ‘City’ and ending with ‘Age’.

 

Select Rows Based on a Condition or Set of Conditions

One of the most powerful features of the loc property is its ability to select rows based on a specific condition or a combination of conditions.

This is valuable when you want to filter your DataFrame to include only rows that meet certain criteria.

Single Condition

For instance, let’s select all rows where the ‘Age’ is greater than 30:

age_above_30 = df.loc[df['Age'] > 30]
print(age_above_30)

Output:

         City  Age
Name              
Charlie  Paris   35

In this case, the condition df['Age'] > 30 returns a Boolean Series that indicates whether each row meets the condition.

The loc property then uses this Series to filter out rows that don’t meet the criteria.

Multiple Conditions

You can combine multiple conditions using the & (and), | (or), and ~ (not) operators.

For example, to select all rows where the ‘Age’ is greater than 25 and the ‘City’ is not ‘London’:

filtered_data = df.loc[(df['Age'] > 25) & ~(df['City'] == 'London')]
print(filtered_data)

Output:

          City  Age
Name               
Charlie  Paris   35

Remember to use parentheses around each condition to ensure proper evaluation order.

Read more about Selecting Data by Multiple Conditions using Pandas loc.

 

Select Rows and Columns Based on Conditions

Pandas provides flexibility not just in selecting rows based on conditions but also in concurrently filtering rows and selecting specific columns.

Rows Based on Condition, Specific Columns

For instance, let’s fetch the ‘City’ column for rows where ‘Age’ is above 30:

city_age_above_30 = df.loc[df['Age'] > 30, 'City']
print(city_age_above_30)

Output:

Name
Charlie    Paris
Name: City, dtype: object

Rows and Columns Based on Multiple Conditions

Suppose you want to extract the ‘City’ column for individuals aged above 25 but exclude those living in ‘London’. Here’s how:

# Selecting the 'City' column based on multiple row conditions
filtered_city = df.loc[(df['Age'] > 25) & ~(df['City'] == 'London'), 'City']
print(filtered_city)

Output:

Name
Charlie    Paris
Name: City, dtype: object

 

Set a Single Value in a Specific Cell

You can use the loc property to change the value of a specific cell in a DataFrame.

Let’s say you want to update the ‘Age’ of ‘Alice’ to 28 in our existing DataFrame. Here’s how you can achieve this:

# Displaying the original DataFrame
print("Original DataFrame:")
print(df)

# Updating the 'Age' of 'Alice' to 28
df.loc['Alice', 'Age'] = 28

# Displaying the updated DataFrame
print("\nUpdated DataFrame:")
print(df)

Output:

Original DataFrame:
              City  Age
Name                   
Alice     New York   25
Bob         London   30
Charlie      Paris   35

Updated DataFrame:
              City  Age
Name                   
Alice     New York   28
Bob         London   30
Charlie      Paris   35

As can be seen in the output, only the specified cell is modified, leaving the rest of the DataFrame unchanged.

 

Set Values for an Entire Column or Row

The loc property allows you to update the values for an entire column or row.

Set Values for an Entire Column

If you want to assign a new value to every cell in a specific column, you can do so directly using the loc property. Let’s assign a value of 40 to the entire ‘Age’ column:

# Displaying the original DataFrame
print("Original DataFrame:")
print(df)

# Setting the entire 'Age' column to 40
df.loc[:, 'Age'] = 40

# Displaying the updated DataFrame
print("\nUpdated DataFrame:")
print(df)

Output:

Original DataFrame:
              City  Age
Name                   
Alice     New York   28
Bob         London   30
Charlie      Paris   35

Updated DataFrame:
              City  Age
Name                   
Alice     New York   40
Bob         London   40
Charlie      Paris   40

Set Values for an Entire Row

Similarly, to update values for an entire row, you can specify the row label and use the assignment operation.

Let’s say we want to update the data for ‘Alice’ to “Los Angeles” for ‘City’ and 40 for ‘Age’:

# Displaying the original DataFrame
print("Original DataFrame:")
print(df)

# Setting the entire row for 'Alice'
df.loc['Alice', :] = ["Los Angeles", 40]

# Displaying the updated DataFrame
print("\nUpdated DataFrame:")
print(df)

Output:

Original DataFrame:
              City  Age
Name                   
Alice     New York   28
Bob         London   30
Charlie      Paris   35

Updated DataFrame:
                 Age    City
Name                        
Alice    Los Angeles      40
Bob               30  London
Charlie           35   Paris

By specifying the row label ‘Alice’ and using : for all columns, we direct the assignment to target the entire row for ‘Alice’.

 

Set Values Based on a Condition or Set of Conditions

Applying conditional logic to modify DataFrame values is a core feature of Pandas. This is useful for tasks such as data cleaning, transformation, and imputation.

Set Values for a Single Column Based on a Condition

Let’s say we want to categorize ages in our DataFrame. We’ll set a new column ‘Category’ as ‘Young’ for ages below 35 and ‘Old’ otherwise:

# Initialize a new column 'Category' with default value
df['Category'] = 'Old'

# Update 'Category' based on the 'Age' condition
df.loc[df['Age'] < 35, 'Category'] = 'Young'

print(df)

Output:

         Age      City Category
Name                           
Alice     25  New York    Young
Bob       30    London    Young
Charlie   35     Paris      Old

Set Values Based on Multiple Conditions

Suppose you want to set the ‘City’ for individuals aged 35 and older who are categorized as ‘Old’ to ‘Unknown’:

df['Category'] = 'Old'
df.loc[(df['Age'] >= 35) & (df['Category'] == 'Old'), 'City'] = 'Unknown'
print(df)

Output:

         Age      City Category
Name                           
Alice     25  New York      Old
Bob       30    London      Old
Charlie   35   Unknown      Old

Set Values Across Multiple Columns Based on a Condition

Maybe you want to reset both ‘City’ and ‘Category’ for individuals whose age is 35:

df['Category'] = 'Old'
df.loc[df['Age'] == 35, ['City', 'Category']] = ['Not Specified', 'Middle-aged']
print(df)

Output:

         Age           City     Category
Name                                    
Alice     25       New York          Old
Bob       30         London          Old
Charlie   35  Not Specified  Middle-aged

 

Using Lambda Functions for Row or Column Selections

When lambda functions are combined with Pandas, they offer a powerful method to make dynamic row or column selections in DataFrames based on custom logic.

Using Lambda with apply() for Row Selection

Suppose you want to select rows where the length of the city name is more than 6 characters:

filtered_rows = df.loc[df['City'].apply(lambda x: len(x) > 6)]
print(filtered_rows)

Output:

       Age      City
Name                
Alice   25  New York

Using Lambda with apply() for Column Selection

To select columns where at least one row has a value length greater than 8:

# Sample DataFrame for this example
df = pd.DataFrame({
    'City': ['New York', 'London', 'Paris'],
    'Description': ['Big Apple', 'The City', 'City of Lights']
})

filtered_columns = df.loc[:, df.applymap(lambda x: len(str(x)) > 8).any()]
print(filtered_columns)

Using Lambda with applymap() for Element-wise Operations

You can also use lambda functions with applymap() for element-wise operations. For instance, to transform all strings in the DataFrame to their upper-case versions:

upper_case_df = df.applymap(lambda x: str(x).upper())
print(upper_case_df)

Output:

       City     Description
0  NEW YORK       BIG APPLE
1    LONDON        THE CITY
2     PARIS  CITY OF LIGHTS

 

Behavior of loc When a Label is Missing

The loc property in Pandas is primarily label-based, which means you use it to refer to rows or columns based on their labels (or index for rows).

However, when you attempt to access a label that does not exist in the DataFrame or Series using loc, it raises a KeyError.

Let’s understand the behavior and handling strategies through examples:

Access a Non-existent Row Label

Suppose we try to access a row with the label ‘Daniel’ in our sample DataFrame, and ‘Daniel’ is not one of the row labels:

try:
    print(df.loc['Daniel'])
except KeyError as e:
    print(f"KeyError: {e}")

Output:

KeyError: 'the label [Daniel] is not in the [index]'

Access a Non-existent Column Label

Similarly, trying to access a column that doesn’t exist will also result in a KeyError. For instance, trying to access a column ‘Salary’:

try:
    print(df.loc[:, 'Salary'])
except KeyError as e:
    print(f"KeyError: {e}")

Output:

KeyError: 'the label [Salary] is not in the [columns]'

Handling Missing Labels

Check Before Accessing: One strategy is to check if the label exists before attempting to access it.

if 'Daniel' in df.index:
    print(df.loc['Daniel'])
else:
    print("'Daniel' not found in the DataFrame.")

Using get() Method: Another approach is to use the get() method which doesn’t raise an error for missing labels but instead returns a default value.

result = df.get('Daniel')
if result is None:
    print("'Daniel' not found in the DataFrame.")
else:
    print(result)

 

Resource

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html

Leave a Reply

Your email address will not be published. Required fields are marked *