Convert Pandas DataFrame to NumPy array using to_numpy()
The DataFrame.to_numpy()
function, provided by the Pandas library, offers a straightforward way to transform a DataFrame into a NumPy array.
It returns an ndarray (NumPy’s basic data structure), which can easily be manipulated using various NumPy library functions.
This is especially useful when you want to perform operations that are easier or faster to implement in NumPy compared to Pandas.
Why Convert from Pandas DataFrame to NumPy?
There are multiple reasons why you want to convert a Pandas DataFrame to a NumPy array.
When it comes to numerical or mathematical operations, the NumPy library is often more efficient due to its support for array datatype, which allows for faster computation.
Additionally, many machine learning libraries, such as Scikit-learn, require inputs to be in the form of NumPy arrays.
Later in this tutorial, you’ll see some practical uses and real-world examples of converting DataFrame into a NumPy array.
Syntax and parameters
The DataFrame.to_numpy()
function is straightforward. Its syntax is as follows:
DataFrame.to_numpy(dtype=None, copy=False, na_value=None)
Where:
dtype
is an optional parameter specifying the desired data type for the array. If not provided, Pandas determines the dtype from the DataFrame’s dtypes.copy
is a boolean flag which, when set to True, ensures that the returned array is a copy of the DataFrame’s data. The default value is False.na_value
defines the value to be used to fill NaN values. If not specified, NaN values will be filled with the default NaN value of the chosen dtype.
Convert Pandas DataFrame to NumPy array
To convert a Pandas DataFrame into a NumPy array.
Step 1: Import the required libraries
The first step involves importing Pandas and NumPy.
import pandas as pd import numpy as np
Step 2: Create a DataFrame
Next, create a DataFrame:
df = pd.DataFrame({ 'A': [1.5, 2.3, 3.1], 'B': [4.2, 5.8, 6.7] })
You can create your DataFrame by loading the data using any of the following ways:
Read CSV using Python Pandas read_csv
Read JSON files using Python Pandas read_json
Read SQL Query/Table into DataFrame using Pandas read_sql
Read HTML tables using Pandas read_html function
Read Parquet files using Pandas read_parquet
Step 3: Convert DataFrame to NumPy Array
Now, convert the DataFrame to a NumPy array using the DataFrame.to_numpy()
function. You can optionally specify the data type. In this case, let’s convert the data to ‘float64’.
array = df.to_numpy(dtype='float64')
Step 4: Print the resulting array or process it the way you want.
Lastly, print the resulting NumPy array to confirm the conversion.
print(array)
Output:
array([[1.5, 4.2], [2.3, 5.8], [3.1, 6.7]])
We have now successfully converted a Pandas DataFrame into a NumPy array of type ‘float64’.
Each row from the DataFrame corresponds to a row in the ndarray, preserving the original structure.
Data Type Handling
The DataFrame.to_numpy()
function provides dtype
argument that lets you specify the desired output data type.
If not specified, the DataFrame.to_numpy()
function will attempt to infer the best data type.
However, this can sometimes lead to a dtype of ‘object’ if the DataFrame contains mixed data types, which might not be desirable, especially when you aim to perform mathematical operations on the resulting array.
Here is an example of specifying the dtype during conversion:
df = pd.DataFrame({ 'A': [1, 2, 3], 'B': ['a', 'b', 'c'] }) array = df.to_numpy(dtype='str') print(array)
Output:
array([['1', 'a'], ['2', 'b'], ['3', 'c']], dtype='<U1')
Here, even though column ‘A’ contains integers, we’ve successfully converted the entire DataFrame into a NumPy array of dtype ‘str’.
Preserving Metadata
One important thing to remember when converting a Pandas DataFrame to a NumPy array is that metadata such as column names and indices are not preserved in the resulting NumPy array, as it’s a lower-level data structure.
However, you can separately store the column names in a variable before the conversion, like this:
df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4, 5, 6] }) # store column names column_names = df.columns.tolist() array = df.to_numpy() print("Column Names:", column_names) print("Array:n", array)
Output:
Column Names: ['A', 'B'] Array: [[1 4] [2 5] [3 6]]
In the output, we have the column names preserved in the column_names
list, and the data converted to a NumPy array.
How to Handle Missing Values?
When converting a DataFrame to a NumPy array using DataFrame.to_numpy()
, by default, Pandas will convert missing values (NaNs) to a type-specific default value in the ndarray, which is usually np.nan
.
If you want to fill missing values with a specific value, you can use the na_value
parameter of the to_numpy()
function, as shown below:
df = pd.DataFrame({ 'A': [1, 2, np.nan], 'B': [4, np.nan, 6] }) # convert dataframe to numpy array filling NaN with -1 array = df.to_numpy(na_value=-1) print(array)
Output:
array([[ 1., 4.], [ 2., -1.], [-1., 6.]])
In this example, we used the na_value
parameter to replace all NaNs with -1
in the resulting ndarray.
Practical Examples of Using to_numpy()
There can be instances where you need to convert a DataFrame to a NumPy array for certain operations that are easier or more efficient in NumPy.
Machine Learning with Scikit-Learn
If you are working on a project where you are predicting house prices based on features like the number of bedrooms, size of the house, location, etc.
You can start off using Pandas to handle your data because it provides powerful data manipulation tools and works well with heterogeneously-typed data.
import pandas as pd data = pd.read_csv('house_prices.csv')
Now, when you want to train a machine learning model on this data using scikit-learn, you have to convert the DataFrame to a NumPy array:
# Split the data into features and target X = data.drop('Price', axis=1) # Features y = data['Price'] # Target # Convert DataFrame to NumPy array X = X.to_numpy() y = y.to_numpy()
Now you can use this data for training and testing in a scikit-learn model:
from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test)
In this example, converting the pandas DataFrame to a numpy array was a requirement to use the scikit-learn library.
Image Processing with OpenCV
If you’re working on a computer vision project where you’re classifying images of different hand-written digits. The labels (the digits) and the image file names are stored in a CSV file.
import pandas as pd
data = pd.read_csv('image_labels.csv')
print(data.head())
Output:
ImageName Label 0 img1.png 7 1 img2.png 2 2 img3.png 1 3 img4.png 0 4 img5.png 4
You have to load the images for further processing with OpenCV. To do this, you’d need to convert the relevant DataFrame columns to NumPy arrays.
import cv2 import numpy as np images = [cv2.imread(f'images/{name}', cv2.IMREAD_GRAYSCALE) for name in data['ImageName']] # Convert list to NumPy array images = np.array(images) # Similarly, convert labels to NumPy array labels = data['Label'].to_numpy()
This example illustrates how you start with Pandas for handling and inspecting data, but then need to convert to a NumPy array for image processing.
Using NumPy’s Financial Functions
If you are working in a finance company and you have a dataset containing information about several investment options with different annual interest rates and terms.
import pandas as pd data = pd.read_csv('investments.csv') print(data.head())
Output:
Investment_Name Annual_Interest_Rate Term_in_Years 0 Investment A 0.05 5 1 Investment B 0.06 7 2 Investment C 0.04 3 3 Investment D 0.08 10 4 Investment E 0.07 8
Let’s say want to calculate the future value of a $1000 investment for each of these options.
You can use NumPy fv
function, which requires the rate, number of periods, payment, and present value as inputs, all of which should be in the format of a NumPy array.
import numpy as np # Convert pandas DataFrame columns to NumPy arrays rates = data['Annual_Interest_Rate'].to_numpy() terms = data['Term_in_Years'].to_numpy() # Constants pv = -1000 # (it's negative as it's an outgoing payment) pmt = 0 # Calculate future value of investment using NumPy's fv function fv = np.fv(rates, terms, pmt, pv) for i, investment in enumerate(data['Investment_Name']): print(f"The future value of a $1000 investment in {investment} after {terms[i]} years is ${fv[i]:.2f}")
In this example, converting the Pandas DataFrame to a NumPy array is driven by the requirement of the NumPy’s financial functions.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.