14 Ways to Create Pandas DataFrame in Python
A pandas DataFrame is a 2-dimensional labeled data structure that can accommodate data of different types such as integers, strings, and floats.
Throughout this tutorial, we’ll uncover several different ways to create a pandas DataFrame, employing data structures like lists, dictionaries, Series, NumPy arrays, and even other DataFrames.
- 1 Creating Pandas DataFrame from Lists
- 2 Creating DataFrames from Dictionaries
- 3 Single Series
- 4 Multiple Series
- 5 One-Dimensional NumPy Array
- 6 Two-Dimensional NumPy Array
- 7 Multidimensional NumPy Arrays
- 8 Creating DataFrames from other DataFrames using copy()
- 9 Creating DataFrames using Subset Selection
- 10 Creating Dataframe from Text (Using regex)
- 11 Common mistakes to avoid
- 12 It Costs An Arm And Leg
Creating Pandas DataFrame from Lists
Creating a pandas dataframe from a list is a basic and straightforward method. You can do it in several ways.
Simple Lists
Consider the following example where we use a single list to create a DataFrame:
import pandas as pd # create a simple list data = ['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy'] # create dataframe from list df = pd.DataFrame(data, columns=['Name']) print(df)
Output:
Name 0 Adam 1 Tom 2 Lisa 3 Dan 4 Eve 5 Frank 6 Grace 7 Heidi 8 Ivan 9 Judy
In the code snippet above, we create a pandas dataframe from the list called data
which has 10 string elements.
By passing this list to the DataFrame constructor, pd.DataFrame()
, we’re telling pandas to create a DataFrame with one column. We’re also providing the column name ‘Name’ through the columns
parameter.
Nested Lists or List of Lists
Next, let’s examine how we can create a DataFrame using a list of lists:
# create a list of lists data = [['Adam', 25], ['Tom', 30], ['Lisa', 35], ['Dan', 40], ['Eve', 45], ['Frank', 50], ['Grace', 55], ['Heidi', 60], ['Ivan', 65], ['Judy', 70]] # create dataframe from list of lists df = pd.DataFrame(data, columns=['Name', 'Age']) print(df)
Output:
Name Age 0 Adam 25 1 Tom 30 2 Lisa 35 3 Dan 40 4 Eve 45 5 Frank 50 6 Grace 55 7 Heidi 60 8 Ivan 65 9 Judy 70
In this code snippet, we’re creating a DataFrame from a list of lists, where each sublist can be seen as a row in the resulting DataFrame.
The parameter columns
is utilized to explicitly specify the column names ‘Name’ and ‘Age’.
List of Dictionaries
A more flexible way to create a pandas dataframe from a list is to use a list of dictionaries:
# create a list of dictionaries data = [ {'Name': 'Adam', 'Age': 25}, {'Name': 'Tom', 'Age': 30}, {'Name': 'Lisa', 'Age': 35}, {'Name': 'Dan', 'Age': 40}, {'Name': 'Eve', 'Age': 45}, {'Name': 'Frank', 'Age': 50}, {'Name': 'Grace', 'Age': 55}, {'Name': 'Heidi', 'Age': 60}, {'Name': 'Ivan', 'Age': 65}, {'Name': 'Judy', 'Age': 70}, ] # create dataframe from list of dictionaries df = pd.DataFrame(data) print(df)
Output:
Name Age 0 Adam 25 1 Tom 30 2 Lisa 35 3 Dan 40 4 Eve 45 5 Frank 50 6 Grace 55 7 Heidi 60 8 Ivan 65 9 Judy 70
In this case, we’ve created a DataFrame from a list of dictionaries. Each dictionary within the list represents a row in the DataFrame.
The dictionary keys become the column names. If keys are missing in the dictionaries, pandas will fill that space with NaN
values.
Creating DataFrames from Dictionaries
You can use dictionaries of different structures: dictionary of lists, dictionary of Series, or dictionary of dictionaries.
Dictionary of Lists
A common way to create a DataFrame is to use a dictionary of lists:
data = { 'Name': ['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy'], 'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70] } df = pd.DataFrame(data) print(df)
Output:
Name Age 0 Adam 25 1 Tom 30 2 Lisa 35 3 Dan 40 4 Eve 45 5 Frank 50 6 Grace 55 7 Heidi 60 8 Ivan 65 9 Judy 70
In the above code, we’re passing a dictionary to the DataFrame constructor. Each key-value pair corresponds to a column; the key becomes the column name, and the values form the data in the column.
Dictionary of Series
You can also create a DataFrame from a dictionary of Series. Each Series in the dictionary forms a column in the DataFrame:
data = { 'Name': pd.Series(['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy']), 'Age': pd.Series([25, 30, 35, 40, 45, 50, 55, 60, 65, 70]) } df = pd.DataFrame(data) print(df)
Output:
Name Age 0 Adam 25 1 Tom 30 2 Lisa 35 3 Dan 40 4 Eve 45 5 Frank 50 6 Grace 55 7 Heidi 60 8 Ivan 65 9 Judy 70
In this case, each series in the dictionary represents a column. The keys of the dictionary are used as column labels. The index is the union of all the Series indexes.
Dictionary of Dictionaries (Nested Dictionaries)
Another example of creating a dataframe from a dictionary is by passing a dictionary of dictionaries. In this case, each dictionary represents a column:
data = {'Adam': {'Age': 25}, 'Tom': {'Age': 30}, 'Lisa': {'Age': 35}, 'Dan': {'Age': 40}, 'Eve': {'Age': 45}, 'Frank': {'Age': 50}, 'Grace': {'Age': 55}, 'Heidi': {'Age': 60}, 'Ivan': {'Age': 65}, 'Judy': {'Age': 70}} df = pd.DataFrame(data) print(df)
Output:
Adam Tom Lisa Dan Eve Frank Grace Heidi Ivan Judy Age 25 30 35 40 45 50 55 60 65 70
The keys of the outer dictionary are used as column labels.
The keys of the inner dictionaries are used as row labels, if there are common keys among the inner dictionaries, they become the row labels; if not, pandas will fill in with NaN
values for missing data.
However, if you want the same output as the previous examples, you can transform your data using Transpose like this:
data = {'Adam': {'Age': 25}, 'Tom': {'Age': 30}, 'Lisa': {'Age': 35}, 'Dan': {'Age': 40}, 'Eve': {'Age': 45}, 'Frank': {'Age': 50}, 'Grace': {'Age': 55}, 'Heidi': {'Age': 60}, 'Ivan': {'Age': 65}, 'Judy': {'Age': 70}} df = pd.DataFrame(data).T.reset_index() df.columns = ['Name', 'Age'] print(df)
Output:
Name Age 0 Adam 25 1 Tom 30 2 Lisa 35 3 Dan 40 4 Eve 45 5 Frank 50 6 Grace 55 7 Heidi 60 8 Ivan 65 9 Judy 70
The T
transposes the DataFrame (switches the axes) and reset_index
makes the index into a column.
Single Series
Creating a DataFrame from a single Series is straightforward:
data = pd.Series(['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy'], name='Name') df = pd.DataFrame(data) print(df)
Output:
Name 0 Adam 1 Tom 2 Lisa 3 Dan 4 Eve 5 Frank 6 Grace 7 Heidi 8 Ivan 9 Judy
In the above code, we’ve created a DataFrame from a Series, where the Series forms a column in the DataFrame. The name of the Series is used as the column name.
Multiple Series
Another example to create a DataFrame is to create it from multiple Series:
series_1 = pd.Series(['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy'], name='Name') series_2 = pd.Series([25, 30, 35, 40, 45, 50, 55, 60, 65, 70], name='Age') df = pd.DataFrame([series_1, series_2]).transpose() print(df)
Output:
Name Age 0 Adam 25 1 Tom 30 2 Lisa 35 3 Dan 40 4 Eve 45 5 Frank 50 6 Grace 55 7 Heidi 60 8 Ivan 65 9 Judy 70
In this example, we create a DataFrame from two Series by first making a list of Series and then transposing the DataFrame.
The name of each Series becomes a column name in the DataFrame.
One-Dimensional NumPy Array
Here’s how to create a DataFrame from a one-dimensional NumPy array:
import numpy as np import pandas as pd data = np.array(['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy']) df = pd.DataFrame(data, columns=['Name']) print(df)
Output:
Name 0 Adam 1 Tom 2 Lisa 3 Dan 4 Eve 5 Frank 6 Grace 7 Heidi 8 Ivan 9 Judy
In the above snippet, we pass a one-dimensional NumPy array to the DataFrame constructor.
This creates a DataFrame with a single column. We provide the column name ‘Name’ using the columns
parameter.
Two-Dimensional NumPy Array
Let’s see how we can create a DataFrame from a two-dimensional NumPy array:
data = np.array([['Adam', 25], ['Tom', 30], ['Lisa', 35], ['Dan', 40], ['Eve', 45], ['Frank', 50], ['Grace', 55], ['Heidi', 60], ['Ivan', 65], ['Judy', 70]]) df = pd.DataFrame(data, columns=['Name', 'Age']) print(df)
Output:
Name Age 0 Adam 25 1 Tom 30 2 Lisa 35 3 Dan 40 4 Eve 45 5 Frank 50 6 Grace 55 7 Heidi 60 8 Ivan 65 9 Judy 70
In this example, we pass a two-dimensional NumPy array to the DataFrame constructor. This creates a DataFrame where each inner array is treated as a row.
We specify the column names ‘Name’ and ‘Age’ through the columns
parameter.
Multidimensional NumPy Arrays
Creating DataFrames from multidimensional arrays requires special considerations as DataFrames are two-dimensional structures.
You can handle a three-dimensional array, for instance, by creating multi-indexed DataFrame.
Here’s an example:
import numpy as np import pandas as pd # create a three-dimensional numpy array data = np.random.randint(1, 10, (2, 10, 3)) # 2 sets, 10 rows, 3 columns # create multi-index index = pd.MultiIndex.from_product([range(s)for s in data.shape], names=['Set', 'Row', 'Column']) # create dataframe from 3d numpy array df = pd.DataFrame({'Value': data.flatten()}, index=index) print(df)
Output:
Value Set Row Column 0 0 0 3 1 5 2 6 1 0 4 1 2 ... ... 1 8 2 7 9 0 3 1 9 2 4
In this example, we create a three-dimensional NumPy array with random values.
Since DataFrame is a two-dimensional structure, we have to flatten the 3D array using flatten()
method and then create a MultiIndex DataFrame that shows the ‘set’, ‘row’, and ‘column’ of the original 3D array.
You need to decide how to represent the data since a pandas DataFrame is inherently a 2D structure.
Note that this method can handle more than three dimensions as well.
However, there are many ways to convert the NumPy array to Pandas DataFrame.
Creating DataFrames from other DataFrames using copy()
We can create a new DataFrame that is a copy of an existing DataFrame using the copy()
method.
This is useful if you want to create a new DataFrame to manipulate without affecting the original data.
Here’s how to do it:
data = { 'Name': ['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy'], 'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70] } df1 = pd.DataFrame(data) # create a new dataframe that is a copy of df1 df2 = df1.copy() print(df2)
Output:
Name Age 0 Adam 25 1 Tom 30 2 Lisa 35 3 Dan 40 4 Eve 45 5 Frank 50 6 Grace 55 7 Heidi 60 8 Ivan 65 9 Judy 70
In this example, we create an original DataFrame df1
with some data. We then use the copy()
method to create df2
, which is an independent copy of df1
. Any changes made to df2
will not affect df1
, and vice versa.
Creating DataFrames using Subset Selection
You can create a new DataFrame from a subset of an existing DataFrame. This can be done by selecting certain rows, columns, or both.
Here’s how to do it:
# create an original dataframe data = { 'Name': ['Adam', 'Tom', 'Lisa', 'Dan', 'Eve', 'Frank', 'Grace', 'Heidi', 'Ivan', 'Judy'], 'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70], 'Gender': ['F', 'M', 'M', 'M', 'F', 'M', 'F', 'F', 'M', 'F'] } df = pd.DataFrame(data) # create a new dataframe from a subset of the original dataframe subset_df = df[df['Gender'] == 'F'] print(subset_df)
Output:
Name Age Gender 0 Adam 25 F 4 Eve 45 F 6 Grace 55 F 7 Heidi 60 F 9 Judy 70 F
In this example, we start with a DataFrame df
containing ‘Name’, ‘Age’ and ‘Gender’ columns. We then create a new DataFrame subset_df
that contains only the rows where ‘Gender’ is ‘F’.
This is done using Boolean indexing, where df['Gender'] == 'F'
returns a Boolean Series that is True
for all rows where ‘Gender’ is ‘F’. This Series is used to select the subset of rows from df
.
Creating Dataframe from Text (Using regex)
We can create a DataFrame from text data whatever the delimiter is by parsing and extracting the needed data using regular expressions:
import re # sample text data text = ''' Adam is 25 years old Tom is 30 years old Lisa is 35 years old Dan is 40 years old Eve is 45 years old Frank is 50 years old Grace is 55 years old Heidi is 60 years old Ivan is 65 years old Judy is 70 years old ''' # use regex to extract names and ages names = re.findall(r'(\w+) is \d+ years old', text) ages = re.findall(r'\w+ is (\d+) years old', text) # create dataframe from the extracted data df = pd.DataFrame({ 'Name': names, 'Age': ages }) print(df)
Output:
Name Age 0 Adam 25 1 Tom 30 2 Lisa 35 3 Dan 40 4 Eve 45 5 Frank 50 6 Grace 55 7 Heidi 60 8 Ivan 65 9 Judy 70
In this example, we start with a string that contains the text data. We use the re.findall()
function with regular expressions to extract the names and ages from the text.
The extracted names and ages are used to create the DataFrame.
Note that re.findall()
returns lists, which are used as values in the dictionary passed to the DataFrame constructor.
Common mistakes to avoid
Creating a DataFrame, a fundamental operation in pandas, can sometimes be tricky for various reasons. There are a few common pitfalls that you should be aware of.
Being conscious of these mistakes will enhance your DataFrame creation efficiency and reduce time spent on debugging.
Mismatched Data and Column Lengths
One common mistake is trying to create a DataFrame where the length of the data passed does not match the length of the specified columns.
The lengths of all input data and columns must be equal, otherwise, pandas will raise an error.
import pandas as pd try: df = pd.DataFrame({ 'Name': ['Adam', 'Tom', 'Lisa'], # 3 items 'Age': [25, 30] # 2 items }) except ValueError as e: print(e)
This will raise a ValueError
that says “arrays must all be same length”. So always ensure that your data arrays and columns are of equal length when creating your DataFrame.
Not Specifying Column Names for List of Lists
When creating a DataFrame from a list of lists without specifying the columns, pandas will automatically assign integer column names starting from 0. This might not be what you intended.
data = [['Adam', 25], ['Tom', 30], ['Lisa', 35]] df = pd.DataFrame(data) print(df)
This will create a DataFrame with columns named 0 and 1. To avoid this, always specify column names when creating your DataFrame.
Inconsistent Data Types
DataFrames can hold different data types but when you have inconsistent data types within a single column, it can lead to unexpected results.
For example, if you have a column that is mostly integers, but there’s a single row with a string, pandas will upcast the entire column to object dtype.
import pandas as pd data = { 'Name': ['Adam', 'Tom', 'Lisa', 'Dan', 'Eve'], 'Age': [25, 'Thirty', 35, 40, 45] } df = pd.DataFrame(data) print(df) print("\nData types:") print(df.dtypes)
Output:
Name Age 0 Adam 25 1 Tom Thirty 2 Lisa 35 3 Dan 40 4 Eve 45 Data types: Name object Age object dtype: object
In this example, most of the ‘Age’ values are integers, but because one value (‘Thirty’) is a string, pandas automatically upcasts the entire ‘Age’ column to object
.
It’s better to keep a uniform data type for each column for efficiency in storage and operations.
Deep Copy vs Shallow Copy
When creating a DataFrame from another DataFrame, a common mistake is to overlook the fact that, by default, pandas returns a view (shallow copy) and not a copy of the data.
That means if you create a new DataFrame based on another one and modify it, you will end up modifying the original DataFrame.
To avoid this, use the copy()
function when creating a DataFrame from another DataFrame if you don’t want the new DataFrame to be linked to the original.
Understanding these common mistakes can help you work more effectively with pandas DataFrames. Remember, practice makes perfect, so the more you work with these structures, the more efficient you’ll become.
It Costs An Arm And Leg
In my years of being a Python developer, I’ve been part of many complex projects, but there’s one in particular that stands out when we discuss the DataFrame creation in Pandas.
The project was for a major telecommunications company here in Egypt, and it involved analyzing massive datasets of user behaviour, network performance, and demographics to better tailor their services and marketing efforts.
We were working with terabytes of data, much of it streaming in real-time from network sensors and user behavior logs.
Each day brought a massive amount of data, its very volume making it a challenge to process and analyze. We used Python’s Pandas library due to its superior ability to handle large datasets and perform complex data manipulations.
One night, as I was wrestling with a particularly tricky bit of analysis, my scripts started throwing ValueErrors. “Arrays must all be the same length”.
I had seen this before, but it was unexpected in this situation. All my data was supposed to come in complete records from our data pipeline.
I started debugging, first checking the latest chunk of data to come in. It turned out that one of our data sources had an issue – a malfunctioning network sensor was sending incomplete data.
For most records, there were seven data points, but the faulty sensor was sending only six.
This was causing the ‘arrays must all be the same length’ error when I tried to create a DataFrame with mismatched column lengths.
Once I identified the issue, I took a two-pronged approach to avoid such problems in the future.
Firstly, I started implementing data validation checks before attempting to create DataFrames.
These checks ensured that each record had the correct number of data points before trying to load it into a DataFrame. This allowed me to catch any issues with the data right at the source.
Secondly, I added more robust error handling to my DataFrame creation scripts.
Now, if there was an issue with the data that caused a ValueError, the script would log the problematic data and continue with the rest of the dataset.
This way, a single problematic record wouldn’t halt the analysis of millions of good records.
This incident was a reminder that when dealing with real-world data, it’s crucial to be prepared for irregularities and faults. While Pandas makes data manipulation easier, you always have to anticipate and plan for the unexpected.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.