Convert Pandas DataFrame to NumPy array using to_numpy()

The DataFrame.to_numpy() function, provided by the Pandas library, offers a straightforward way to transform a DataFrame into a NumPy array.
It returns an ndarray (NumPy’s basic data structure), which can easily be manipulated using various NumPy library functions.

This is especially useful when you want to perform operations that are easier or faster to implement in NumPy compared to Pandas.



Why Convert from Pandas DataFrame to NumPy?

There are multiple reasons why you want to convert a Pandas DataFrame to a NumPy array.

When it comes to numerical or mathematical operations, the NumPy library is often more efficient due to its support for array datatype, which allows for faster computation.
Additionally, many machine learning libraries, such as Scikit-learn, require inputs to be in the form of NumPy arrays.

Later in this tutorial, you’ll see some practical uses and real-world examples of converting DataFrame into a NumPy array.


Syntax and parameters

The DataFrame.to_numpy() function is straightforward. Its syntax is as follows:

DataFrame.to_numpy(dtype=None, copy=False, na_value=None)


  • dtype is an optional parameter specifying the desired data type for the array. If not provided, Pandas determines the dtype from the DataFrame’s dtypes.
  • copy is a boolean flag which, when set to True, ensures that the returned array is a copy of the DataFrame’s data. The default value is False.
  • na_value defines the value to be used to fill NaN values. If not specified, NaN values will be filled with the default NaN value of the chosen dtype.


Convert Pandas DataFrame to NumPy array

To convert a Pandas DataFrame into a NumPy array.
Step 1: Import the required libraries
The first step involves importing Pandas and NumPy.

import pandas as pd
import numpy as np

Step 2: Create a DataFrame
Next, create a DataFrame:

df = pd.DataFrame({
    'A': [1.5, 2.3, 3.1],
    'B': [4.2, 5.8, 6.7]

You can create your DataFrame by loading the data using any of the following ways:

Read CSV using Python Pandas read_csv

Read JSON files using Python Pandas read_json

Read SQL Query/Table into DataFrame using Pandas read_sql

Read HTML tables using Pandas read_html function

Read Parquet files using Pandas read_parquet

Step 3: Convert DataFrame to NumPy Array
Now, convert the DataFrame to a NumPy array using the DataFrame.to_numpy() function. You can optionally specify the data type. In this case, let’s convert the data to ‘float64’.

array = df.to_numpy(dtype='float64')

Step 4: Print the resulting array or process it the way you want.
Lastly, print the resulting NumPy array to confirm the conversion.



array([[1.5, 4.2],
       [2.3, 5.8],
       [3.1, 6.7]])

We have now successfully converted a Pandas DataFrame into a NumPy array of type ‘float64’.

Each row from the DataFrame corresponds to a row in the ndarray, preserving the original structure.


Data Type Handling

The DataFrame.to_numpy() function provides dtype argument that lets you specify the desired output data type.
If not specified, the DataFrame.to_numpy() function will attempt to infer the best data type.

However, this can sometimes lead to a dtype of ‘object’ if the DataFrame contains mixed data types, which might not be desirable, especially when you aim to perform mathematical operations on the resulting array.
Here is an example of specifying the dtype during conversion:

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': ['a', 'b', 'c']
array = df.to_numpy(dtype='str')


array([['1', 'a'],
       ['2', 'b'],
       ['3', 'c']], dtype='<U1')

Here, even though column ‘A’ contains integers, we’ve successfully converted the entire DataFrame into a NumPy array of dtype ‘str’.


Preserving Metadata

One important thing to remember when converting a Pandas DataFrame to a NumPy array is that metadata such as column names and indices are not preserved in the resulting NumPy array, as it’s a lower-level data structure.
However, you can separately store the column names in a variable before the conversion, like this:

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]

# store column names
column_names = df.columns.tolist()
array = df.to_numpy()
print("Column Names:", column_names)
print("Array:n", array)


Column Names: ['A', 'B']
 [[1 4]
 [2 5]
 [3 6]]

In the output, we have the column names preserved in the column_names list, and the data converted to a NumPy array.


How to Handle Missing Values?

When converting a DataFrame to a NumPy array using DataFrame.to_numpy(), by default, Pandas will convert missing values (NaNs) to a type-specific default value in the ndarray, which is usually np.nan.
If you want to fill missing values with a specific value, you can use the na_value parameter of the to_numpy() function, as shown below:

df = pd.DataFrame({
    'A': [1, 2, np.nan],
    'B': [4, np.nan, 6]

# convert dataframe to numpy array filling NaN with -1
array = df.to_numpy(na_value=-1)


array([[ 1.,  4.],
       [ 2., -1.],
       [-1.,  6.]])

In this example, we used the na_value parameter to replace all NaNs with -1 in the resulting ndarray.


Practical Examples of Using to_numpy()

There can be instances where you need to convert a DataFrame to a NumPy array for certain operations that are easier or more efficient in NumPy.

Machine Learning with Scikit-Learn

If you are working on a project where you are predicting house prices based on features like the number of bedrooms, size of the house, location, etc.

You can start off using Pandas to handle your data because it provides powerful data manipulation tools and works well with heterogeneously-typed data.

import pandas as pd
data = pd.read_csv('house_prices.csv')

Now, when you want to train a machine learning model on this data using scikit-learn, you have to convert the DataFrame to a NumPy array:

# Split the data into features and target
X = data.drop('Price', axis=1) # Features
y = data['Price'] # Target

# Convert DataFrame to NumPy array
X = X.to_numpy()
y = y.to_numpy()

Now you can use this data for training and testing in a scikit-learn model:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression(), y_train)
y_pred = model.predict(X_test)

In this example, converting the pandas DataFrame to a numpy array was a requirement to use the scikit-learn library.

Image Processing with OpenCV

If you’re working on a computer vision project where you’re classifying images of different hand-written digits. The labels (the digits) and the image file names are stored in a CSV file.

import pandas as pd
data = pd.read_csv('image_labels.csv')


 ImageName Label
0 img1.png 7
1 img2.png 2
2 img3.png 1
3 img4.png 0
4 img5.png 4

You have to load the images for further processing with OpenCV. To do this, you’d need to convert the relevant DataFrame columns to NumPy arrays.

import cv2
import numpy as np

images = [cv2.imread(f'images/{name}', cv2.IMREAD_GRAYSCALE) for name in data['ImageName']]

# Convert list to NumPy array
images = np.array(images)

# Similarly, convert labels to NumPy array
labels = data['Label'].to_numpy()

This example illustrates how you start with Pandas for handling and inspecting data, but then need to convert to a NumPy array for image processing.

Using NumPy’s Financial Functions

If you are working in a finance company and you have a dataset containing information about several investment options with different annual interest rates and terms.

import pandas as pd
data = pd.read_csv('investments.csv')


  Investment_Name  Annual_Interest_Rate  Term_in_Years
0       Investment A                 0.05             5
1       Investment B                 0.06             7
2       Investment C                 0.04             3
3       Investment D                 0.08             10
4       Investment E                 0.07             8

Let’s say want to calculate the future value of a $1000 investment for each of these options.

You can use NumPy fv function, which requires the rate, number of periods, payment, and present value as inputs, all of which should be in the format of a NumPy array.

import numpy as np

# Convert pandas DataFrame columns to NumPy arrays
rates = data['Annual_Interest_Rate'].to_numpy()
terms = data['Term_in_Years'].to_numpy()

# Constants
pv = -1000 # (it's negative as it's an outgoing payment)
pmt = 0

# Calculate future value of investment using NumPy's fv function
fv = np.fv(rates, terms, pmt, pv)

for i, investment in enumerate(data['Investment_Name']):
    print(f"The future value of a $1000 investment in {investment} after {terms[i]} years is ${fv[i]:.2f}")

In this example, converting the Pandas DataFrame to a NumPy array is driven by the requirement of the NumPy’s financial functions.

Leave a Reply

Your email address will not be published. Required fields are marked *