14 Ways to Create Pandas DataFrame in Python

A pandas DataFrame is a 2-dimensional labeled data structure that can accommodate data of different types such as integers, strings, and floats.

Throughout this tutorial, we’ll uncover several different ways to create a pandas DataFrame, employing data structures like lists, dictionaries, Series, NumPy arrays, and even other DataFrames.

 

 

Creating Pandas DataFrame from Lists

Creating a pandas dataframe from a list is a basic and straightforward method. You can do it in several ways.

Simple Lists

Consider the following example where we use a single list to create a DataFrame:

import pandas as pd

# create a simple list
data = ['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy']

# create dataframe from list
df = pd.DataFrame(data, columns=['Name'])
print(df)

Output:

      Name
0     Adam
1      Tom
2     Lisa
3      Dan
4      Eve
5    Frank
6    Grace
7    Heidi
8     Ivan
9     Judy

In the code snippet above, we create a pandas dataframe from the list called data which has 10 string elements.

By passing this list to the DataFrame constructor, pd.DataFrame(), we’re telling pandas to create a DataFrame with one column. We’re also providing the column name ‘Name’ through the columns parameter.

 

Nested Lists or List of Lists

Next, let’s examine how we can create a DataFrame using a list of lists:

# create a list of lists
data = [['Adam', 25], ['Tom', 30], ['Lisa', 35], ['Dan', 40], ['Eve', 45], ['Frank', 50], ['Grace', 55], ['Heidi', 60], ['Ivan', 65], ['Judy', 70]]

# create dataframe from list of lists
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

Output:

      Name  Age
0     Adam   25
1      Tom   30
2     Lisa   35
3      Dan   40
4      Eve   45
5    Frank   50
6    Grace   55
7    Heidi   60
8     Ivan   65
9     Judy   70

In this code snippet, we’re creating a DataFrame from a list of lists, where each sublist can be seen as a row in the resulting DataFrame.

The parameter columns is utilized to explicitly specify the column names ‘Name’ and ‘Age’.

 

List of Dictionaries

A more flexible way to create a pandas dataframe from a list is to use a list of dictionaries:

# create a list of dictionaries
data = [
    {'Name': 'Adam', 'Age': 25}, 
    {'Name': 'Tom', 'Age': 30}, 
    {'Name': 'Lisa', 'Age': 35},
    {'Name': 'Dan', 'Age': 40},
    {'Name': 'Eve', 'Age': 45},
    {'Name': 'Frank', 'Age': 50},
    {'Name': 'Grace', 'Age': 55},
    {'Name': 'Heidi', 'Age': 60},
    {'Name': 'Ivan', 'Age': 65},
    {'Name': 'Judy', 'Age': 70},
]

# create dataframe from list of dictionaries
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age
0     Adam   25
1      Tom   30
2     Lisa   35
3      Dan   40
4      Eve   45
5    Frank   50
6    Grace   55
7    Heidi   60
8     Ivan   65
9     Judy   70

In this case, we’ve created a DataFrame from a list of dictionaries. Each dictionary within the list represents a row in the DataFrame.

The dictionary keys become the column names. If keys are missing in the dictionaries, pandas will fill that space with NaN values.

 

Creating DataFrames from Dictionaries

You can use dictionaries of different structures: dictionary of lists, dictionary of Series, or dictionary of dictionaries.

Dictionary of Lists

A common way to create a DataFrame is to use a dictionary of lists:

data = {
    'Name': ['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy'], 
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70]
}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age
0     Adam   25
1      Tom   30
2     Lisa   35
3      Dan   40
4      Eve   45
5    Frank   50
6    Grace   55
7    Heidi   60
8     Ivan   65
9     Judy   70

In the above code, we’re passing a dictionary to the DataFrame constructor. Each key-value pair corresponds to a column; the key becomes the column name, and the values form the data in the column.

 

Dictionary of Series

You can also create a DataFrame from a dictionary of Series. Each Series in the dictionary forms a column in the DataFrame:

data = {
    'Name': pd.Series(['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy']), 
    'Age': pd.Series([25, 30, 35, 40, 45, 50, 55, 60, 65, 70])
}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age
0     Adam   25
1      Tom   30
2     Lisa   35
3      Dan   40
4      Eve   45
5    Frank   50
6    Grace   55
7    Heidi   60
8     Ivan   65
9     Judy   70

In this case, each series in the dictionary represents a column. The keys of the dictionary are used as column labels. The index is the union of all the Series indexes.

 

Dictionary of Dictionaries (Nested Dictionaries)

Another example of creating a dataframe from a dictionary is by passing a dictionary of dictionaries. In this case, each dictionary represents a column:

data = {'Adam': {'Age': 25}, 'Tom': {'Age': 30}, 'Lisa': {'Age': 35}, 'Dan': {'Age': 40}, 'Eve': {'Age': 45}, 'Frank': {'Age': 50}, 'Grace': {'Age': 55}, 'Heidi': {'Age': 60}, 'Ivan': {'Age': 65}, 'Judy': {'Age': 70}}
df = pd.DataFrame(data)
print(df)

Output:

      Adam  Tom     Lisa  Dan  Eve  Frank  Grace  Heidi  Ivan  Judy
Age     25   30       35   40   45     50     55     60    65    70

The keys of the outer dictionary are used as column labels.

The keys of the inner dictionaries are used as row labels, if there are common keys among the inner dictionaries, they become the row labels; if not, pandas will fill in with NaN values for missing data.

However, if you want the same output as the previous examples, you can transform your data using Transpose like this:

data = {'Adam': {'Age': 25}, 'Tom': {'Age': 30}, 'Lisa': {'Age': 35}, 'Dan': {'Age': 40}, 'Eve': {'Age': 45}, 'Frank': {'Age': 50}, 'Grace': {'Age': 55}, 'Heidi': {'Age': 60}, 'Ivan': {'Age': 65}, 'Judy': {'Age': 70}}
df = pd.DataFrame(data).T.reset_index()
df.columns = ['Name', 'Age']
print(df)

Output:

    Name  Age
0   Adam   25
1    Tom   30
2   Lisa   35
3    Dan   40
4    Eve   45
5  Frank   50
6  Grace   55
7  Heidi   60
8   Ivan   65
9   Judy   70

The T transposes the DataFrame (switches the axes) and reset_index makes the index into a column.

 

Single Series

Creating a DataFrame from a single Series is straightforward:

data = pd.Series(['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy'], name='Name')
df = pd.DataFrame(data)
print(df)

Output:

      Name
0     Adam
1      Tom
2     Lisa
3      Dan
4      Eve
5    Frank
6    Grace
7    Heidi
8     Ivan
9     Judy

In the above code, we’ve created a DataFrame from a Series, where the Series forms a column in the DataFrame. The name of the Series is used as the column name.

 

Multiple Series

Another example to create a DataFrame is to create it from multiple Series:

series_1 = pd.Series(['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy'], name='Name')
series_2 = pd.Series([25, 30, 35, 40, 45, 50, 55, 60, 65, 70], name='Age')
df = pd.DataFrame([series_1, series_2]).transpose()
print(df)

Output:

      Name Age
0     Adam  25
1      Tom  30
2     Lisa  35
3      Dan  40
4      Eve  45
5    Frank  50
6    Grace  55
7    Heidi  60
8     Ivan  65
9     Judy  70

In this example, we create a DataFrame from two Series by first making a list of Series and then transposing the DataFrame.

The name of each Series becomes a column name in the DataFrame.

 

One-Dimensional NumPy Array

Here’s how to create a DataFrame from a one-dimensional NumPy array:

import numpy as np
import pandas as pd
data = np.array(['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy'])
df = pd.DataFrame(data, columns=['Name'])
print(df)

Output:

      Name
0     Adam
1      Tom
2     Lisa
3      Dan
4      Eve
5    Frank
6    Grace
7    Heidi
8     Ivan
9     Judy

In the above snippet, we pass a one-dimensional NumPy array to the DataFrame constructor.

This creates a DataFrame with a single column. We provide the column name ‘Name’ using the columns parameter.

 

Two-Dimensional NumPy Array

Let’s see how we can create a DataFrame from a two-dimensional NumPy array:

data = np.array([['Adam', 25], ['Tom', 30], ['Lisa', 35], ['Dan', 40], ['Eve', 45], ['Frank', 50], ['Grace', 55], ['Heidi', 60], ['Ivan', 65], ['Judy', 70]])
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

Output:

      Name Age
0     Adam  25
1      Tom  30
2     Lisa  35
3      Dan  40
4      Eve  45
5    Frank  50
6    Grace  55
7    Heidi  60
8     Ivan  65
9     Judy  70

In this example, we pass a two-dimensional NumPy array to the DataFrame constructor. This creates a DataFrame where each inner array is treated as a row.

We specify the column names ‘Name’ and ‘Age’ through the columns parameter.

 

Multidimensional NumPy Arrays

Creating DataFrames from multidimensional arrays requires special considerations as DataFrames are two-dimensional structures.

You can handle a three-dimensional array, for instance, by creating multi-indexed DataFrame.
Here’s an example:

import numpy as np
import pandas as pd
# create a three-dimensional numpy array
data = np.random.randint(1, 10, (2, 10, 3))  # 2 sets, 10 rows, 3 columns

# create multi-index
index = pd.MultiIndex.from_product([range(s)for s in data.shape], names=['Set', 'Row', 'Column'])

# create dataframe from 3d numpy array
df = pd.DataFrame({'Value': data.flatten()}, index=index)
print(df)

Output:

                 Value
Set Row Column       
0   0   0           3
        1           5
        2           6
    1   0           4
        1           2
...                ...
1   8   2           7
    9   0           3
        1           9
        2           4

In this example, we create a three-dimensional NumPy array with random values.

Since DataFrame is a two-dimensional structure, we have to flatten the 3D array using flatten() method and then create a MultiIndex DataFrame that shows the ‘set’, ‘row’, and ‘column’ of the original 3D array.

You need to decide how to represent the data since a pandas DataFrame is inherently a 2D structure.

Note that this method can handle more than three dimensions as well.

However, there are many ways to convert the NumPy array to Pandas DataFrame.

 

Creating DataFrames from other DataFrames using copy()

We can create a new DataFrame that is a copy of an existing DataFrame using the copy() method.

This is useful if you want to create a new DataFrame to manipulate without affecting the original data.
Here’s how to do it:

data = {
    'Name': ['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy'], 
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70]
}
df1 = pd.DataFrame(data)

# create a new dataframe that is a copy of df1
df2 = df1.copy()
print(df2)

Output:

      Name  Age
0     Adam   25
1      Tom   30
2     Lisa   35
3      Dan   40
4      Eve   45
5    Frank   50
6    Grace   55
7    Heidi   60
8     Ivan   65
9     Judy   70

In this example, we create an original DataFrame df1 with some data. We then use the copy() method to create df2, which is an independent copy of df1. Any changes made to df2 will not affect df1, and vice versa.

 

Creating DataFrames using Subset Selection

You can create a new DataFrame from a subset of an existing DataFrame. This can be done by selecting certain rows, columns, or both.
Here’s how to do it:

# create an original dataframe
data = {
    'Name': ['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy'], 
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'Gender': ['F', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'M', 'F']
}
df = pd.DataFrame(data)

# create a new dataframe from a subset of the original dataframe
subset_df = df[df['Gender'] == 'F']
print(subset_df)

Output:

    Name  Age Gender
0   Adam   25   F
4    Eve   45   F
6  Grace   55   F
7  Heidi   60   F
9   Judy   70   F

In this example, we start with a DataFrame df containing ‘Name’, ‘Age’ and ‘Gender’ columns. We then create a new DataFrame subset_df that contains only the rows where ‘Gender’ is ‘F’.

This is done using Boolean indexing, where df['Gender'] == 'F' returns a Boolean Series that is True for all rows where ‘Gender’ is ‘F’. This Series is used to select the subset of rows from df.

 

Creating Dataframe from Text (Using regex)

We can create a DataFrame from text data whatever the delimiter is by parsing and extracting the needed data using regular expressions:

import re

# sample text data
text = '''
Adam is 25 years old
Tom is 30 years old
Lisa is 35 years old
Dan is 40 years old
Eve is 45 years old
Frank is 50 years old
Grace is 55 years old
Heidi is 60 years old
Ivan is 65 years old
Judy is 70 years old
'''

# use regex to extract names and ages
names = re.findall(r'(\w+) is \d+ years old', text)
ages = re.findall(r'\w+ is (\d+) years old', text)

# create dataframe from the extracted data
df = pd.DataFrame({
    'Name': names,
    'Age': ages
})
print(df)

Output:

      Name Age
0     Adam  25
1      Tom  30
2     Lisa  35
3      Dan  40
4      Eve  45
5    Frank  50
6    Grace  55
7    Heidi  60
8     Ivan  65
9     Judy  70

In this example, we start with a string that contains the text data. We use the re.findall() function with regular expressions to extract the names and ages from the text.

The extracted names and ages are used to create the DataFrame.

Note that re.findall() returns lists, which are used as values in the dictionary passed to the DataFrame constructor.

 

Common mistakes to avoid

Creating a DataFrame, a fundamental operation in pandas, can sometimes be tricky for various reasons. There are a few common pitfalls that you should be aware of.

Being conscious of these mistakes will enhance your DataFrame creation efficiency and reduce time spent on debugging.

 

Mismatched Data and Column Lengths

One common mistake is trying to create a DataFrame where the length of the data passed does not match the length of the specified columns.

The lengths of all input data and columns must be equal, otherwise, pandas will raise an error.

import pandas as pd
try:
    df = pd.DataFrame({
        'Name': ['Adam', 'Tom', 'Lisa'],  # 3 items
        'Age': [25, 30]  # 2 items
    })
except ValueError as e:
    print(e)

This will raise a ValueError that says “arrays must all be same length”. So always ensure that your data arrays and columns are of equal length when creating your DataFrame.

Not Specifying Column Names for List of Lists

When creating a DataFrame from a list of lists without specifying the columns, pandas will automatically assign integer column names starting from 0. This might not be what you intended.

data = [['Adam', 25], ['Tom', 30], ['Lisa', 35]]
df = pd.DataFrame(data)
print(df)

This will create a DataFrame with columns named 0 and 1. To avoid this, always specify column names when creating your DataFrame.

Inconsistent Data Types

DataFrames can hold different data types but when you have inconsistent data types within a single column, it can lead to unexpected results.

For example, if you have a column that is mostly integers, but there’s a single row with a string, pandas will upcast the entire column to object dtype.

import pandas as pd
data = {
    'Name': ['Adam', 'Tom', 'Lisa', 'Dan', 'Eve'],
    'Age': [25, 'Thirty', 35, 40, 45]
}
df = pd.DataFrame(data)
print(df)
print("\nData types:")
print(df.dtypes)

Output:

      Name     Age
0     Adam      25
1      Tom  Thirty
2     Lisa      35
3      Dan      40
4      Eve      45

Data types:
Name    object
Age     object
dtype: object

In this example, most of the ‘Age’ values are integers, but because one value (‘Thirty’) is a string, pandas automatically upcasts the entire ‘Age’ column to object.

It’s better to keep a uniform data type for each column for efficiency in storage and operations.

 

Deep Copy vs Shallow Copy

When creating a DataFrame from another DataFrame, a common mistake is to overlook the fact that, by default, pandas returns a view (shallow copy) and not a copy of the data.

That means if you create a new DataFrame based on another one and modify it, you will end up modifying the original DataFrame.

To avoid this, use the copy() function when creating a DataFrame from another DataFrame if you don’t want the new DataFrame to be linked to the original.

Understanding these common mistakes can help you work more effectively with pandas DataFrames. Remember, practice makes perfect, so the more you work with these structures, the more efficient you’ll become.

 

It Costs An Arm And Leg

In my years of being a Python developer, I’ve been part of many complex projects, but there’s one in particular that stands out when we discuss the DataFrame creation in Pandas.

The project was for a major telecommunications company here in Egypt, and it involved analyzing massive datasets of user behaviour, network performance, and demographics to better tailor their services and marketing efforts.

We were working with terabytes of data, much of it streaming in real-time from network sensors and user behavior logs.

Each day brought a massive amount of data, its very volume making it a challenge to process and analyze. We used Python’s Pandas library due to its superior ability to handle large datasets and perform complex data manipulations.

One night, as I was wrestling with a particularly tricky bit of analysis, my scripts started throwing ValueErrors. “Arrays must all be the same length”.

I had seen this before, but it was unexpected in this situation. All my data was supposed to come in complete records from our data pipeline.

I started debugging, first checking the latest chunk of data to come in. It turned out that one of our data sources had an issue – a malfunctioning network sensor was sending incomplete data.

For most records, there were seven data points, but the faulty sensor was sending only six.

This was causing the ‘arrays must all be the same length’ error when I tried to create a DataFrame with mismatched column lengths.

Once I identified the issue, I took a two-pronged approach to avoid such problems in the future.

Firstly, I started implementing data validation checks before attempting to create DataFrames.

These checks ensured that each record had the correct number of data points before trying to load it into a DataFrame. This allowed me to catch any issues with the data right at the source.

Secondly, I added more robust error handling to my DataFrame creation scripts.

Now, if there was an issue with the data that caused a ValueError, the script would log the problematic data and continue with the rest of the dataset.

This way, a single problematic record wouldn’t halt the analysis of millions of good records.

This incident was a reminder that when dealing with real-world data, it’s crucial to be prepared for irregularities and faults. While Pandas makes data manipulation easier, you always have to anticipate and plan for the unexpected.


Leave a Reply

Your email address will not be published. Required fields are marked *