Add Rows to Pandas MultiIndex DataFrame

In this tutorial, you will learn various methods to add rows to a Pandas MultiIndex DataFrame.

We will use real-world sample data to demonstrate the process step-by-step.

 

 

Using loc[]

First, create a sample MultiIndex DataFrame:

import pandas as pd
arrays = [
    ['2020-Q1', '2020-Q1', '2020-Q2', '2020-Q2'],
    ['AT&T', 'Verizon', 'AT&T', 'Verizon']
]
index = pd.MultiIndex.from_arrays(arrays, names=('Quarter', 'Company'))
data = {'Revenue': [10, 12, 11, 14], 'Expenses': [5, 4, 6, 7]}
df = pd.DataFrame(data, index=index)
print(df)

Output:

                 Revenue  Expenses
Quarter Company                   
2020-Q1 AT&T          10         5
        Verizon       12         4
2020-Q2 AT&T          11         6
        Verizon       14         7

Now, you can add rows using the loc[] method. Here’s how:

# Add a row for AT&T in '2020-Q3'
new_row = pd.Series({'Revenue': 9, 'Expenses': 5}, name=('2020-Q3', 'AT&T'))
df.loc[("2020-Q3", "AT&T"), :] = new_row

# Add another row for Verizon in '2020-Q3'
new_row2 = pd.Series({'Revenue': 13, 'Expenses': 6}, name=('2020-Q3', 'Verizon'))
df.loc[("2020-Q3", "Verizon"), :] = new_row2

# Use .reindex to add the new index values as empty rows
df = df.reindex(df.index.union([("2020-Q3", "AT&T"), ("2020-Q3", "Verizon")]))
print(df)

Output:

                 Revenue  Expenses
Quarter Company                   
2020-Q1 AT&T        10.0       5.0
        Verizon     12.0       4.0
2020-Q2 AT&T        11.0       6.0
        Verizon     14.0       7.0
2020-Q3 AT&T         9.0       5.0
        Verizon     13.0       6.0

First, we insert the index first using .reindex, then assign the values.

We use .loc to select those rows by index and assign the data.

 

Using concat()

The concat() method is useful when you have to concatenate along an axis—either rows or columns.

Add Single Row

Here’s how you can use concat() to add a single row for Verizon in ‘2021-Q1’:

# Create a new row as a DataFrame
new_row = pd.DataFrame({'Revenue': [18], 'Expenses': [9]}, index=pd.MultiIndex.from_tuples([('2021-Q1', 'Verizon')], names=('Quarter', 'Company')))

# Concatenate the new row
df = pd.concat([df, new_row])
print(df)

Output:

                 Revenue  Expenses
Quarter Company                   
2020-Q1 AT&T        10.0       5.0
        Verizon     12.0       4.0
2020-Q2 AT&T        11.0       6.0
        Verizon     14.0       7.0
2020-Q3 AT&T         9.0       5.0
        Verizon     13.0       6.0
2021-Q1 Verizon     18.0       9.0

Add Multiple Rows

To concatenate multiple rows, you’ll create a DataFrame containing these rows and then use concat():

# Create new rows as a DataFrame
new_rows = pd.DataFrame({'Revenue': [19, 20], 'Expenses': [10, 11]}, index=pd.MultiIndex.from_tuples([('2021-Q1', 'T-Mobile'), ('2021-Q2', 'AT&T')], names=('Quarter', 'Company')))

# Concatenate the new rows
df = pd.concat([df, new_rows])
print(df)

Output:

                  Revenue  Expenses
Quarter Company                    
2020-Q1 AT&T         10.0       5.0
        Verizon      12.0       4.0
2020-Q2 AT&T         11.0       6.0
        Verizon      14.0       7.0
2020-Q3 AT&T          9.0       5.0
        Verizon      13.0       6.0
2021-Q1 Verizon      18.0       9.0
        T-Mobile     19.0      10.0
2021-Q2 AT&T         20.0      11.0

 

Adding Rows with Different Levels

Let’s first create a MultiIndex DataFrame that has an additional level called ‘Region’.

# Create sample MultiIndex DataFrame using telecom data with an additional level
arrays = [
    ['2020-Q1', '2020-Q1', '2020-Q2', '2020-Q2'],
    ['AT&T', 'Verizon', 'AT&T', 'Verizon'],
    ['East', 'West', 'East', 'West']
]
index = pd.MultiIndex.from_arrays(arrays, names=('Quarter', 'Company', 'Region'))
data = {'Revenue': [10, 12, 11, 14], 'Expenses': [5, 4, 6, 7]}
df_multi_level = pd.DataFrame(data, index=index)
print(df_multi_level)

Output:

                          Revenue  Expenses
Quarter Company Region                     
2020-Q1  AT&T    East         10         5
        Verizon  West         12         4
2020-Q2  AT&T    East         11         6
        Verizon  West         14         7

If you want to add a row that doesn’t have all the index levels, you’ll need to manually create a MultiIndex with the missing level set to None or some placeholder.

# Create custom index with a missing level
custom_index = pd.MultiIndex.from_tuples([('2020-Q3', 'AT&T', None)], names=('Quarter', 'Company', 'Region'))

# Create a DataFrame with the custom index
custom_row = pd.DataFrame({'Revenue': [9], 'Expenses': [5]}, index=custom_index)

# Append the custom row to the original DataFrame
df_multi_level = df_multi_level.append(custom_row)
print(df_multi_level)

Output:

                          Revenue  Expenses
Quarter Company Region                     
2020-Q1  AT&T    East         10         5
        Verizon  West         12         4
2020-Q2  AT&T    East         11         6
        Verizon  West         14         7
2020-Q3  AT&T    NaN           9         5

In this example, we added a row for AT&T in ‘2020-Q3’ with a missing ‘Region’.

 

Benchmarking Row Addition Methods

To set up the benchmark, let’s first create a large MultiIndex DataFrame with 100,000 rows.

import pandas as pd
import numpy as np

# Set row length
n = 100_000

# Create arrays for MultiIndex
quarters = ['2020-Q1', '2020-Q2', '2020-Q3', '2020-Q4'] * 25_000
companies = ['AT&T', 'Verizon', 'T-Mobile'] * 33_333 + ['AT&T']

# Create MultiIndex
index = pd.MultiIndex.from_arrays([quarters, companies], names=['Quarter', 'Company'])

# Create DataFrame
df_large = pd.DataFrame({
    'Revenue': np.random.randint(1, 20, n),
    'Expenses': np.random.randint(1, 10, n)
}, index=index)
print(df_large.head())

Output:

                  Revenue  Expenses
Quarter Company                    
2020-Q1 AT&T           19         9
2020-Q2 Verizon        15         4
2020-Q3 T-Mobile       16         9
2020-Q4 AT&T           16         1
2020-Q1 Verizon         2         7

Benchmark loc[]

To add a row using loc[], let’s time the operation.

start_time = time.time()
df_large_sorted = df_large.sort_index()  # Sort the DataFrame by the MultiIndex
for _ in range(n):
    row = pd.Series({
        'Revenue': np.random.randint(1, 20),
        'Expenses': np.random.randint(1, 10)
    }, name=('2021-Q1', 'Google'))
    df_large_sorted.loc[row.name] = row
loc_time = time.time() - start_time
df_loc = df_large_sorted.sort_index()
print(f"Time taken to add rows using loc: {loc_time} seconds")

Benchmark concat()

Lastly, let’s benchmark the concat() method.

start_time = time.time()
new_rows = pd.DataFrame({
    'Revenue': np.random.randint(1, 20, n),
    'Expenses': np.random.randint(1, 10, n)
}, index=index)
df_concat = pd.concat([df_large, new_rows])
concat_time = time.time() - start_time
print(f"Time taken to add rows using concat: {concat_time} seconds")

Output:

Time taken to add rows using concat: 0.003000497817993164 seconds
Time taken to add rows using loc: 1.3989920616149902 seconds

The concat method is faster than using loc to add rows to a multiIndex DataFrame, especially when dealing with a large number of rows.

The reason is that concat operates on entire DataFrames at once, whereas loc adds rows one by one in a loop.

Leave a Reply

Your email address will not be published. Required fields are marked *