Python

Python pandas tutorial: Getting started with DataFrames

Pandas is an open source Python library which provides data analysis and manipulation in Python programming.

It’s a very promising library in data representation, filtering, and statistical programming. The most important piece in pandas is the DataFrame where you store and play with the data.


In this tutorial, you will learn what is the DataFrame, how to create it from different sources, how to export it to different outputs, and how to manipulate its data.

 

Install pandas

You can install pandas in Python by using pip. Run the following command in cmd:

pip install pandas

install pandas

Also, you can install pandas using conda like this:

conda install pandas

 

Read Excel file

You can read from an Excel file using the read_excel() method of pandas. For this, you need to import one more module called xlrd.

Install xlrd using pip:

pip install xlrd

install xlrd

The example below demonstrates how to read from an Excel sheet:

  1. We created an Excel sheet with the following contents:
    Excel sheet
  2. Import the pandas module:

import pandas

  1. We will pass the name of the Excel file and the sheet number we need to read data from to the read_excel() method.

pandas.read_excel('pandasExcel.xlsx', 'Sheet1')

The above snippet will generate the following output:

Read Excel file

If you check the type of the output using type keyword, it will give you the following result:

<class 'pandas.core.frame.DataFrame'>

It’s called a DataFrame! That is the basic unit of pandas that we are going to deal with till the end of the tutorial.

The DataFrame is a labeled 2 Dimensional structure where we can store data of different types. DataFrame is similar to a SQL table or an Excel spreadsheet.

 


Import CSV file

To read from a CSV file, you can use the read_csv() method of pandas.

Import the pandas module:

import pandas

Now call the read_csv() method as follows:

pandas.read_csv('Book1.csv')

Book1.csv has the following contents:

CSV file

The code will generate the following DataFrame:

Import CSV file

Read text file

We can also use the read_csv method of pandas to read from a text file; consider the following example:


The myFile.txt looks like the following:

Text file

The output of the above code will be:

Read text file

This text file is treated as a CSV file because we have comma separated elements. The file also may use another delimiter such as a semicolon, tab, etc.

Suppose we have a tab delimiter and the file looks like this:

Tab delimited file

When the delimiter is a tab, we will have the following output:

Read text file with tab delimiter

Since pandas has no idea about the delimiter, it translates the tab to \t.

To define the tab character as a delimiter, pass the delimiter argument like this:

pandas.read_csv('myFile.txt', delimiter='\t')

Now the output will be:

Specify tab delimiter

Looks correct now.

 


Read SQL

You can use the read_sql() method of pandas to read from an SQL database. This is demonstrated in the example below:


In this example, we connected to a SQLite3 database that has a table named “Employee”. Using the read_sql() method of pandas, then we passed a query and a connection object to the read_sql() method. The query fetches all the data in the table.

Our Employee table looks like the following:

SQLite3 table

When you run the above code, the output will be like the following:

Read SQL using pandas

 

Select columns

Suppose we have three columns in the Employee table like this:

SQLite3 table columns

To select columns from the table, we will pass the following query:

select Name, Job from Employee

The pandas code statement will be as follows:

pandas.read_sql('select Name, Job from Employee', con)

Use SQL to select data

We can also select a column from a table by accessing the data frame a. Consider the following example:


The result will be as follows:

Get columns

 

Select rows by value

First, we will create a DataFrame from which we will select rows.

To create a DataFrame, consider the code below:


In this code, we created a DataFrame with three columns and three rows using the DataFrame() method of pandas. The result will be the following:

Create a dataframe

To select a row based on value, run the following statement:

df.loc[df['name'] == 'Jason']

df.loc[] or DataFrame.loc[] is a Boolean array that can be used to access rows or columns by values or labels. In the above code, the row will be fetched where the name equals Jason.

The output will be:

Select rows by value from a dataframe

 


Select row by index

To select a row by its index, we can either use the slicing (:) operator or the df.loc[] array.

Consider the code below:


We created a DataFrame. Now let’s access a row using df.loc[]:

>>> df.loc[1]

Get dataframe row

As you can see, one row is fetched. We can do the same using the slicing operator as follows:

>>> df[1:2]

Dataframe slicing operator

 

Change column type

The Data type of a column can be changed by using astype() attribute of DataFrame. To check the data type of columns, we use the dtypes attribute of DataFrame.

>>> df.dtypes

The output will be:

Change pandas dataframe column type

Now to convert the data type from one to another:

>>> df.name = df.name.astype(str)

We fetched the column ‘name’ from our DataFrame and changed its data type from object to string.

Apply a function to columns/rows

To apply a function on a column or a row, you can use the apply() method of DataFrame.

Consider the following example:


We created a DataFrame and added values of integer type in the rows. To apply a function for example square root on the values, we will import the numpy module to use the sqrt function from it like this:


The output will be as follows:

Apply a function to columns

To apply the sum function, the code will be:

>>> df.apply(np.sum)

Apply sum function

To apply the function to a specific column, you can specify the column like this:

>>>df['A'].apply(np.sqrt)

 

Sort Values/sort by column

To sort values in a DataFrame, use the sort_values() method of the DataFrame.

Create a DataFrame with integer values:


Now to sort the values:

>>> df.sort_values(by=['A'])

The output will be:

Sort values

The sort_values() method has an attribute ‘by’ which is necessary. In the code above, the values are sorted by column A. To sort by multiple columns, the code will be:

>>> df.sort_values(by=['A', 'B'])

If you want to sort in descending order, set ascending attributed of set_values to False as follows:

>>> df.sort_values(by=['A'], ascending=False)

The output will be:

Sort in descending order

 


Drop/Remove duplicates

To drop duplicate rows from a DataFrame, use the drop_duplicates() method of the DataFrame.

Consider the following example:


Duplicated dataframe

Here we created a DataFrame with a duplicate row. To check if any duplicate rows are present in the DataFrame, use the duplicated() method of the DataFrame.

>>> df.duplicated()

The result will be:

Duplicate result

It can be seen that the last row is a duplicate. To drop or remove this row, run the following line of code:

>>> df.drop_duplicates()

Now the result will be:

Remove duplicates

 

Drop duplicates by column

Sometimes, we have data where the column values are the same and we wish to delete them. We can drop a row by column by passing the name of the column we need to delete.

For example, we have the following DataFrame:


Pandas dataframe with duplicate data

Here you can see that Jason is two times. If you want to remove duplicate by column, just pass the column name as follows:

>>> df.drop_duplicates(['name'])

The result will be like the following:

Drop duplicates by column

Delete a column

To delete an entire column or row, we can use the drop() method of the DataFrame by specifying the name of the column or row.

Consider the following example:

>>> df.drop(['job'], axis=1)

In this line of code, we are deleting the column named ‘job’. The axis argument is necessary here. If the axis value is 1 it means we want to delete columns, if axis value is 0 it means that row will be deleted. In axis values, 0 is for index and 1 is for columns.

The result will be:

Drop a dataframe column

 

Delete rows

We can use the drop() method to drop or delete a row by passing the index of the row.

Suppose we have the following DataFrame:


To drop a row with index 0 where the name is James, age is 18 and the job is Assistant, use the following code:

>>> df.drop([0])

Delete a row by index

Let’s create a DataFrame where the indexes are the names:


Non-labled dataframe

Now we can delete a row with a certain value. For example, if we want to delete a row where the name is Rogers, then the code will be:

>>> df.drop(['Rogers'])

The output will be:

Delete a row by value

You can also delete a range of row as:

>>> df.drop(df.index[[0, 1]])

This will delete rows from index 0 to 1 and one row left only since our DataFrame composed of 3 rows:

Delete a range of rows

If you want to delete the last row from the DataFrame and do not know what is the total number of rows then you can use the negative indexing as below:

>>> df.drop(df.index[-1])

-1 deletes the last row. Similarly -2 will delete last 2 rows and so on.

Drop the last row

 


Sum a column

You can use the sum() method of the DataFrame to sum the column items.

Suppose we have the following DataFrame:


Now to sum the items of column A, use the following line of code:

>>> df['A'].sum()

Sum a column

You can also use the apply() method of the DataFrame and pass in the sum method of numpy to sum the values.

Count unique values

To count unique values in a column, you can use the nunique() method of the DataFrame.

Suppose we have DataFrame as below:


To count the unique values in column A:

>>> df['A'].nunique()

Count unique values

As you can see, column A has only 2 unique values 23 and 12 and another 12 is a duplicate that’s why we have 2 in the output.

If you want to count all the values in a column, you can use the count() method as follows:

>>> df['A'].count()

Count column values

 

Subset rows

To select a subset of a DataFrame, You can use the square brackets.

For example, we have a DataFrame that contains some integers. We can select or subset a row like this:

df.[start:count]

The start point will be included in the subset but stop point is not included. For example, to select 3 rows starting from the first row, you will write:

>>> df[0:3]

The output will be:

Select subset of rows

That code means start from the first row which is 0 and select 3 rows.

Similarly, to select the first 2 rows, you will write:

>>> df[0:2]

Select 2 rows from a dataframe

To select or subset the last row, use the negative indexing as:

>>> df[-1:]

Select last row

Write to Excel

To write a DataFrame to an Excel sheet, we can use the to_excel() method.

To write to an Excel sheet, you have to open the sheet and to open an Excel sheet we will have to import openpyxl module.

Install openpyxl using pip:

pip install openpyxl

Install openpyxl

Consider the following example:


The Excel file will look like the following:

Write to Excel

 

Write to CSV

Similarly, to write a DataFrame to CSV, you can use the to_csv() method as in the following line of code.

>>> df.to_csv("pandasCSV.csv")

The output file will be like the following:

Write to CSV

 

Write to SQL

To write data to SQL, we can use the to_sql() method.

Consider the following example:


In this code, we created a connection with the sqlite3 database. Then we created a DataFrame with three rows and three columns.

Finally, we used the to_sql method of our DataFrame (df) and passed the name of the table where the data will be stored along with the connection object.

The SQL database will look like the following:

Write to SQL

 

Write to JSON

You can use the to_json() method of the DataFrame to write to a JSON file.

This is demonstrated in the example below:

>>> df.to_json("myJson.json")

In this line of code, the name of the JSON file is passed as an argument. The DataFrame will be stored in the JSON file. The file will have the following content:

Write to JSON

 

Write to HTML file

You can use the to_html() method of the DataFrame to create an HTML file with the DataFrame content.

Consider the following example:

>>> df.to_html("myhtml.html")

The result file will have the following content:

Write to html

When you open the HTML file in the browser, it will look like the following:

Html file output

Working with pandas is very easy. It’s like working with Excel sheets! pandas DataFrame is a very flexible piece library you can ever use.

I hope you find the tutorial useful. Keep coming back.

Mokhtar Ebrahim
I'm working as a Linux system administrator since 2010. I'm responsible for maintaining, securing, and troubleshooting Linux servers for multiple clients around the world. I love writing shell and Python scripts to automate my work.

Leave a Reply

Your email address will not be published. Required fields are marked *