Python

Python Pandas DataFrame tutorial

Data analysis has become an essential aspect of every business. Since data has become the new oil, it is paramount to cultivate the most effective form of data to get granular insight into the business.

Pandas is a Python library that is open-source and allows performing various data analysis tasks. It helps in handling structured data through its well-known data structures. In this article, we will discuss DataFrames and how to use them in our program.

 

 

Why do we use Pandas?

Pandas is a prominent library for data analysis and handling large and complex structured data and its various situations through DataFrames.

It helps in data filtering, cleansing, arranging data in a meaningful order, etc. Any real-world data with heavy mess gets sorted and cleaned using Pandas. Hence, it is one of the essential early-stage tools for data handling.

 

1D and 2D data structures in Pandas

There are two prevalent data structures that Pandas come with for data analysis. These are Series and DataFrame. While Series is a one-dimensional labeled array that can hold data of any type (integer, string, float, python objects, etc.), DataFrame is two-dimensional and size mutable.

These data structures define how the data will get arranged for analysis through Pandas. Let us understand DataFrame in detail in this article.

 

What is DataFrame?

A DataFrame is a size-mutable, two-dimensional data structure that can store heterogeneous values with labeled axes (rows and columns).

This two-dimensional data structure remains aligned in a tabular fashion through rows and columns like that of a spreadsheet or database table.

We can create DataFrames using Python list, dictionary, Pandas Series, NumPy array, files (CSV), etc. Let us now check how to create a DataFrame using all of these.

We have to use the Pandas module to create DataFrame. The pandas.DataFrame() is used to create the dataframe object in Python.

 

Creating a Pandas DataFrame from a Python List

We can use a nested list to create a DataFrame through Pandas. Each inner list will represent a row for the DataFrame. Here is a code snippet showing how to create a DataFrame using a list.

import pandas as pd
list1 = [
            [50, "French Fry", "Not spicy"],
            [500, "Pizza", "With Oregano"],
            [5000, "Chocolate Shake", "with choco crisps"]
            ]
df1 = pd.DataFrame(list1)
print(df1)

Output

This output shows how to create Pandas dataframe using list in Python

Creating a Pandas DataFrame from Dictionaries

We can also create a DataFrame using the Python dictionary. In this case, within the dictionary, we can provide the list of values where each list will represent the column values.

Here is a code snippet showing how to create a DataFrame using a dictionary containing lists.

import pandas as pd
dictDat = {
    'Scoville' : [50, 500, 5000],
    'Name' : ["French Fry", "Pizza", "Chocolate Shake"],
    'Feeling' : ["Not spicy", "With Oregano", "with choco crisps"]
}
df1 = pd.DataFrame(dictDat)
print(df1)

Output

This output shows how to create Pandas dataframe using dictionary in Python

Creating a Pandas DataFrame from NumPy Arrays

We can also use the NumPy array (ndarray) to create a DataFrame. Since NumPy is also a part of data analysis and a popular module for data science, data analysts and professionals prefer to use this means for creating DataFrames. We have to import the NumPy module and create a NumPy array first.

Then, we can use the panda.DataFrame() and pass the NumPy array to convert it to a DataFrame. Here is a code snippet showing how to create a DataFrame using a NumPy array.

import numpy as np
import pandas as pd
arry = np.array([[1, 2, 100],[2, 4, 100], [3, 8, 100]])
df1 = pd.DataFrame(arry, columns = ['a', 'b', 'c'])
print(df1)

Output

This output shows how to create Pandas dataframe using NumPy arrays in Python

 

Creating a Pandas DataFrame from Series

It is possible to create DataFrame from multiple Series. We have to create the Series from lists and then use those Series as dictionary values.

The dictionary object will then be passed to the pandas.DataFrame() for converting it to a DataFrame. Here is a code snippet showing how to implement it.

import pandas as pd
researcher = ['Karlos', 'Ray', 'Ramanujan', 'Dee']
papers = [211, 118, 97, 162]
# Creating two Series from lists
r_series = pd.Series(researcher)
paper_research = pd.Series(papers)
# Creating a dictionary through pandas Series objects as values
dframe = {'Researchers': r_series, 'Papers': paper_research}
# Creating DataFrame with the help of dictionary
result = pd.DataFrame(dframe)
print(result)

Output

This output shows how to create Pandas dataframe using series in Python

Creating a Pandas DataFrame from Files

Pandas support different types of files for creating DataFrame. The most supported files are CSV, JSON, XLS, HTML, etc.

The method used to read them are: read_csv(), read_json(), read_excel(), read_html() respectively. The most common file type is the CSV file.

Here is a code snippet showing how to open and display a file data as DataFrame.

import pandas as pd
pepperDataFrame = pd.read_csv('busses_emp.csv')
print(pepperDataFrame)

Output

This output shows how to create Pandas dataframe using dictionary in Python

 

Adding columns to a DataFrame

There are different ways to add a new column to an existing DataFrame. These are:
Method 1: Using a list We can create a new column for our existing DataFrame through a list and associate it with the existing DataFrame object. If the number of elements in the list mismatches, it will create an error. Here is a code showing how to implement it.

import pandas as pd
d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'],
        'Height': [6.1, 6.0, 5.9, 5.6],
        'Qualification': ['MS', 'MBA', 'MCom', 'MCA']}
df = pd.DataFrame(d)
print(df)
print()
# appending a new column name Designation with the values given as list.
designation = ['Tech Content Writer', 'Data Analyst', 'Software Developer', 'Cloud Architect']
df['Designation'] = designation
print(df)

Output

This output shows how to add columns in Pandas dataframe in Python

Method 2: Using the insert() method Another way to create and insert another column to an existing DataFrame is through the insert() method. Here is a code snippet showing how to implement it.

import pandas as pd
d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'],
        'Height': [6.1, 6.0, 5.9, 5.6],
        'Qualification': ['MS', 'MBA', 'MCom', 'MCA']}
df = pd.DataFrame(d)
print(df)
print()
df.insert(2, "Designation ", ['Tech Content Writer', 'Data Analyst', 'Software Developer', 'Cloud Architect'], True)
print(df)

Output

This output shows how to add columns in Pandas dataframe using insert() method in Python

Method 3: Using the assign() method Using this method, we can create a new DataFrame having a new column and assign it to an old DataFrame. Here is a code snippet showing how to implement it.

import pandas as pd
d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'],
        'Height': [6.1, 6.0, 5.9, 5.6],
        'Qualification': ['MS', 'MBA', 'MCom', 'MCA']}
df = pd.DataFrame(d)
print(df)
print()
df2 = df.assign(Location=['Noida', 'Bangalore', 'Hyderabad', 'Jaipur'])
print(df2)

Output

This output shows how to add columns in Pandas dataframe using assign() method in Python

Method 4: Using a dictionary We can use an independent dictionary and directly use it to add a new column in the Pandas DataFrame. Here is a code snippet showing how to implement it.

import pandas as pd
d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'],
        'Height': [6.1, 6.0, 5.9, 5.6],
        'Qualification': ['MS', 'MBA', 'MCom', 'MCA']}
df = pd.DataFrame(d)
print(df)
print()
Location = {'Noida': 'Karl', 'Bangalore': 'Iris',
           'Hyderabad': 'Ray', 'Jaipur': 'Dee'}
df['Location'] = Location
print(df)

Output

This output shows how to add columns in Pandas dataframe using dictionary in Python

 

Deleting Columns from a DataFrame

There are different ways to delete a column from a DataFrame. These are:
Method 1: Using the drop() We can use the DataFrame’s drop() method for removing columns specifying the column (label) names or its corresponding axis. Here is a code snippet showing how to implement it.

import pandas as pd
d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'],
        'Height': [6.1, 6.0, 5.9, 5.6],
        'Qualification': ['MS', 'MBA', 'MCom', 'MCA']}
df = pd.DataFrame(d)
Location = {'Noida': 'Karl', 'Bangalore': 'Iris',
           'Hyderabad': 'Ray', 'Jaipur': 'Dee'}
df['Location'] = Location
print(df)
df.drop('Height', inplace = True, axis = 1)
print(df)

Output

This output shows how to delete columns in Pandas dataframe using drop() method in Python
Method 2: Using the del keyword: We can use the del keyword to delete a column. In this technique, we have to specify the DataFrame name, along with the column name (label). Here is a code snippet showing how to implement it.

import pandas as pd
d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'],
        'Height': [6.1, 6.0, 5.9, 5.6],
        'Qualification': ['MS', 'MBA', 'MCom', 'MCA'],
     'Location':['Noida', 'Bangalore', 'Hyderabad', 'Jaipur']}
df = pd.DataFrame(d)
print(df)
del df['Qualification']
print(df)

Output

This output shows how to delete columns in Pandas dataframe using del keyword in Python
Method 3: using the pop() method: The pop() method also helps in popping out another method from the DataFrame and will return the column. Here is a code snippet showing how to implement it.

import pandas as pd
d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'],
        'Height': [6.1, 6.0, 5.9, 5.6],
        'Qualification': ['MS', 'MBA', 'MCom', 'MCA'],
     'Location':['Noida', 'Bangalore', 'Hyderabad', 'Jaipur']}
df = pd.DataFrame(d)
print(df)
df.pop('Height')
print(df)

Output

This output shows how to delete columns in Pandas dataframe using pop() method in Python

 

Iterating Over a Pandas DataFrame

It is essential to know how to iterate over the items of DataFrame. There are different methods and ways to loop over a Pandas DataFrame.

Each of them has different mechanisms for displaying or extracting the data. Data analysts and professionals can use it at their convenience.
Method 1: Using a normal for loop We can iterate over the different keys of the dictionary to extract the column names. The code will look something like this.

import pandas as pd
d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'],
        'Height': [6.1, 6.0, 5.9, 5.6],
        'Qualification': ['MS', 'MBA', 'MCom', 'MCA'],
     'Location':['Noida', 'Bangalore', 'Hyderabad', 'Jaipur']}
df = pd.DataFrame(d)
for col_name in df:
    print(type(col_name))
    print(col_name)
    print('--------n')

Output

This output shows how to iterate over Pandas dataframe in Python

Method 2: Using the items() method We can use the items() method of the DataFrame that returns a generator. Using this, we can fetch every column name and value and display it using a for loop. Here is the code showing how to use it.

import pandas as pd
d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'],
        'Height': [6.1, 6.0, 5.9, 5.6],
        'Qualification': ['MS', 'MBA', 'MCom', 'MCA'],
     'Location':['Noida', 'Bangalore', 'Hyderabad', 'Jaipur']}
df = pd.DataFrame(d)
for col_name, dat in df.items():
	print("column name:", col_name, "n",dat)

Output

This output shows how to iterate over Pandas dataframe in Python

Method 3: Using iterrows() method: The iterrows() is also a DataFrame’s iterable method that loop over each row of the DataFrame column-wise. We can use this method to fetch the data of each row of an index. Here is the code showing how to use it.

import pandas as pd
d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'],
        'Height': [6.1, 6.0, 5.9, 5.6],
        'Qualification': ['MS', 'MBA', 'MCom', 'MCA'],
     'Location':['Noida', 'Bangalore', 'Hyderabad', 'Jaipur']}
df = pd.DataFrame(d)
for g, r in df.iterrows():
	print(f"Index Value ->: {g}")
	print(f"{r}n")

Output

This output shows how to iterate over Pandas dataframe in Python

Method 4: Using the itertuples() method: With the help of this method, we can retrieve a column of index names and their respective data from the DataFrame for that row, one row at a time. Here is the code showing how to use it.

import pandas as pd
d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'],
        'Height': [6.1, 6.0, 5.9, 5.6],
        'Qualification': ['MS', 'MBA', 'MCom', 'MCA'],
     'Location':['Noida', 'Bangalore', 'Hyderabad', 'Jaipur']}
df = pd.DataFrame(d)
for r in df.itertuples():
    print(type(r))
    print(r)
    print('--------')

Output

This output shows how to iterate over Pandas dataframe in Python

 

Indexing DataFrames

The pandas.DataFrame() contains the index parameter that takes a list to index all its columns with the specified value given within the list.

Here is the code to show how to implement indexing in a DataFrame.

import pandas as pd
d = [['Karlos', 1, 'CTO', 530000],
     ['Ray', 2, 'CMO', 410000],
     ['Sue', 3, 'Marketing Head', 370000],
     ['Bill', 4, 'Security Head', 400000],
     ['Dee', 5, 'CFO', 320000],
     ['Lee', 6, 'IT Manager', 210000],]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'],
                  index=['A', 'B', 'C', 'D', 'E', 'F'])
print(df)

Output

This output shows how to index Pandas dataframe in Python

 

Slicing DataFrames

Slicing a data structure means dividing the data structure into smaller sub-parts to extract specific values from a wide range of values. Pandas allow different ways to extract sliced data from the DataFrame.

Method 1: Slicing out single column: To extract a single column value from the DataFrame, we can use the column value in different ways to slice & extract the entire column out of the DataFrame.

Here is a code snippet showing how to implement it.

import pandas as pd
d = [['Karlos', 1, 'CTO', 530000],
     ['Ray', 2, 'CMO', 410000],
     ['Sue', 3, 'Marketing Head', 370000],
     ['Bill', 4, 'Security Head', 400000],
     ['Dee', 5, 'CFO', 320000],
     ['Lee', 6, 'IT Manager', 210000],]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'],
                  index=['A', 'B', 'C', 'D', 'E', 'F'])
# technique number 1
print(df.loc[:,"Designation"])
# technique number 2
print(df["Monthly Salary"])
# technique number 3
print(df.Name)

Output

This output shows how to slicing Pandas dataframe in Python

Method 2: Select and slice out multiple columns: There is another way to slice multiple DataFrame columns using the dataframe.loc[]. Here is a code snippet showing how to implement it.

import pandas as pd
d = [['Karlos', 1, 'CTO', 530000],
     ['Ray', 2, 'CMO', 410000],
     ['Sue', 3, 'Marketing Head', 370000],
     ['Bill', 4, 'Security Head', 400000],
     ['Dee', 5, 'CFO', 320000],
     ['Lee', 6, 'IT Manager', 210000],]
df = df = pd.DataFrame(d, columns=['Name', 'Emp ID', 'Designation', 'Monthly Salary'],
                  index=['A', 'B', 'C', 'D', 'E', 'F'])
# technique number 1
print(df.loc[:,["Designation", "Monthly Salary"]])
# technique number 2
print(df[["Name", "Monthly Salary"]])

Output

This output shows how to slicing Pandas dataframe in Python

 

Various Built-in DataFrame Methods

To ease the task of handling data within a DataFrame, Pandas allows us to implement various methods. These are:

  1. read_csv(): It is a popular method that helps us extract data from a CSV (Comma Separated Value) file and store it into a Pandas DataFrame to use and extract insight from it. All you need to do is mention the local path name of your CSV file as a string. This method can also read files that have delimiters like tab or pipe (|) apart from the comma (,). Here is a code snippet showing how to use it.
    import pandas as pd
    busDf = pd.read_csv('busses_emp.csv')
    print(busDf)

    Output

    This output shows various built-in methods of Pandas dataframe in Python

  2. head(): The head(n) is a popular function of the Pandas that work closely with the DataFrame objects to return the first ‘n’ rows of a dataset. If you do not pass any value as a parameter, it will return the first five rows (top 5 rows) from the DataFrame. If you want it to return more or fewer rows from the DataFrame, you have to specify the number of rows. Here is a code showing you how to use it.
    import pandas as pd
    d = [['Karlos', 1, 'CTO', 530000],
         ['Ray', 2, 'CMO', 410000],
         ['Sue', 3, 'Marketing Head', 370000],
         ['Bill', 4, 'Security Head', 400000],
         ['Dee', 5, 'CFO', 320000],
         ['Lee', 6, 'IT Manager', 210000],]
    df = df = pd.DataFrame(d, columns=['Name', 'Emp ID', 'Designation', 'Monthly Salary'],
                      index=['A', 'B', 'C', 'D', 'E', 'F'])
    # technique number 1
    print(df.head())
    print()
    # technique number 1
    print(df.head(3))

    Output

    This output shows various built-in methods of Pandas dataframe in Python

  3. tail(): The tail(n) is another popular function of the Pandas DataFrame objects that returns the last ‘n’ rows of a dataset. If you do not pass any value as a parameter, it will return the last five rows (bottom 5 rows) from the DataFrame. If you want it to return more or fewer rows from the DataFrame, you have to specify the number of rows. Here is a code showing you how to use it.
    import pandas as pd
    d = [['Karlos', 1, 'CTO', 530000],
         ['Ray', 2, 'CMO', 410000],
         ['Sue', 3, 'Marketing Head', 370000],
         ['Bill', 4, 'Security Head', 400000],
         ['Dee', 5, 'CFO', 320000],
         ['Lee', 6, 'IT Manager', 210000],]
    df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'],
                      index=['A', 'B', 'C', 'D', 'E', 'F'])
    # technique number 1
    print(df.tail())
    print()
    # technique number 1
    print(df.tail(2))

    Output

    This output shows various built-in methods of Pandas dataframe in Python

  4. describe(): This method helps in generating descriptive statistics of the data that is there in the DataFrame. With this function, we can get a quick overview and dispersed view of the dataset. Here is an example showing how it represents the basic statistical data.
    import pandas as pd
    d = [['Karlos', 1, 'CTO', 530000],
         ['Ray', 2, 'CMO', 410000],
         ['Sue', 3, 'Marketing Head', 370000],
         ['Bill', 4, 'Security Head', 400000],
         ['Dee', 5, 'CFO', 320000],
         ['Lee', 6, 'IT Manager', 210000],]
    df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'],
                      index=['A', 'B', 'C', 'D', 'E', 'F'])
    print(df.describe())

    Output

    This output shows various built-in methods of Pandas dataframe in Python

  5. memory_usage(): This method helps in returning a pandas Series that will hold the memory use of each column (in bytes) of the DataFrame. Once we define the deep attribute as True, we can get the details about the memory getting occupied by every column. Here is an example that shows how to implement it.
    import pandas as pd
    d = [['Karlos', 1, 'CTO', 530000],
         ['Ray', 2, 'CMO', 410000],
         ['Sue', 3, 'Marketing Head', 370000],
         ['Bill', 4, 'Security Head', 400000],
         ['Dee', 5, 'CFO', 320000],
         ['Lee', 6, 'IT Manager', 210000],]
    df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'])
    print(df.memory_usage(deep = True))

    Output

    This output shows various built-in methods of Pandas dataframe in Python

  6. to_datetime(): This method helps in converting a Python object to datetime format. This method can accept lists, floating point values, integers, Series, and DataFrames as its argument. We can use this robust method for datasets that has time series values & dates within them. Here is an example that shows how to implement it.
    import pandas as pd
    d = pd.read_csv("datetimedata.csv")
    d["Date"] = pd.to_datetime(d["Date"])
    print(d.info())
    print(d)

    Output

    This output shows various built-in methods of Pandas dataframe in Python

  7. value_counts(): This method helps in returning a Series data structure having the total of all unique values. If you have a dataset containing employee information about 3,000 employees of an organization. value_counts() will identify and return the occurrences of each value in that data structure.
    import pandas as pd
    d = [['Karlos', 1, 'CTO', 530000],
         ['Ray', 2, 'CMO', 410000],
         ['Sue', 3, 'Marketing Head', 370000],
         ['Bill', 4, 'Security Head', 400000],
         ['Dee', 5, 'CFO', 320000],
         ['Lee', 6, 'IT Manager', 210000],]
    df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'])
    print(df.value_counts())

    Output

    This output shows various built-in methods of Pandas dataframe in Python

  8. drop_duplicates(): If you want to remove any duplicate values from the rows in a DataFrame, this method can help. This method also provides an option to keep the first matched occurrence of the duplicate or the last. You can also specify the inplace as True & ignore_index parameters. Here is a code snippet showing how to use it.
    import pandas as pd
    df = pd.DataFrame({
        'Brand': ['Zomato', 'Sugar', 'Zomato', 'Tata', 'Tata'],
        'Items': ['Noodles', 'Nail Polish', 'Noodles', 'Car', 'pack']
    })
    print(df.drop_duplicates())

    Output

    This output shows various built-in methods of Pandas dataframe in Python

  9. groupby(): You can use the groupby() method to group DataFrame values by one or more columns & perform mathematical functions on them. It also helps in summarizing data in an easy-to-understand way. Here is a code snippet showing how to use it.
    import pandas as pd
    df = pd.read_csv("nba.csv")
    dat = df.groupby(['Team', 'Age'])
    print(dat.first())

    Output

    This output shows various built-in methods of Pandas dataframe in Python

  10. merge(): This method helps merge two Series or DataFrame objects under a common field name. Here is a code snippet showing how to use it.
    import pandas as pd
    df1 = pd.DataFrame({'Company1': ['Item1', 'Item2', 'Item3', 'Item4'],
                        'val': [1, 2, 3, 5] })
    df2 = pd.DataFrame({'Company2': ['Item1', 'Item2', 'Item3', 'Item4'],
                        'val': [5, 6, 7, 8] })
    print(df1.merge(df2, left_on = 'Company1', right_on = 'Company2'))

    Output

    This output shows various built-in methods of Pandas dataframe in Python

  11. sort_values(): As the name suggests, it will sort the DataFrame’s column values in increasing or decreasing order. Here is a code snippet showing how to implement it.
    import pandas as pd
    d = [['Karlos', 1, 'CTO', 530000],
         ['Ray', 2, 'CMO', 410000],
         ['Sue', 3, 'Marketing Head', 370000],
         ['Bill', 4, 'Security Head', 400000],
         ['Dee', 5, 'CFO', 320000],
         ['Lee', 6, 'IT Manager', 210000]]
    df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'])
    df.sort_values(by = 'Monthly Salary', inplace = True)
    print(df)

    Output

    This output shows various built-in methods of Pandas dataframe in Python

  12. fillna(): In a large dataset, we often find datasets that have lots of NaN (Not a Number). This method helps in replacing all the NaN or missing values of the Series or DataFrame with more reasonable values. Here is a code snippet showing how to implement it.
    import pandas as pd
    d = [['Karlos', 1, 'CTO', 530000],
         ['Ray', 2, 'CMO', 410000],
         ['Sue', 3, 'Marketing Head'],
         ['Bill', 4, 'Security Head', 400000],
         ['Dee', 5, 'CFO', 320000],
         ['Lee', 6, 'IT Manager']]
    df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'])
    df.fillna(230000, inplace = True)
    print(df)

    Output

    This output shows various built-in methods of Pandas dataframe in Python

  13. read_json(): We often use the JavaScript Object Notation (JSON) for exchanging data on the web. It is an obvious way to return or accept JSON format for communicating or fetching data from the backend of an application. If you want to extract the JSON data and convert it to DataFrame, Pandas provide the read_json() method to do so. Here is a code snippet showing how to implement it.
    import pandas as pd
    # json string assigned
    str = '{"column 1" : {"rowA":1,"rowB":2,"rowC":3}, "column 2" : {"rowA":"Karl","rowB":"Dee","rowC":"Ray"}}'
    # reading JSON data and converting it to DataFrame
    df = pd.read_json(str)
    print(df)

    Output

    This output shows various built-in methods of Pandas dataframe in Python

 

Retrieving Labels from DataFrames

There are two different ways to retrieve labels of a DataFrame. One of them is using the .index (to fetch the rows) and .column (to fetch the columns).

Here are two code snippets showing how to implement them. dataframe.index:

import pandas as pd
d = [['Karlos', 1, 'CTO', 530000],
     ['Ray', 2, 'CMO', 410000],
     ['Sue', 3, 'Marketing Head', 370000],
     ['Bill', 4, 'Security Head', 400000],
     ['Dee', 5, 'CFO', 320000],
     ['Lee', 6, 'IT Manager', 210000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'])
df.sort_values(by = 'Monthly Salary', inplace = True)
print(df.index)

Output

This output shows retriving the Pandas dataframe in Python

dataframe.column:

import pandas as pd
d = [['Karlos', 1, 'CTO', 530000],
     ['Ray', 2, 'CMO', 410000],
     ['Sue', 3, 'Marketing Head', 370000],
     ['Bill', 4, 'Security Head', 400000],
     ['Dee', 5, 'CFO', 320000],
     ['Lee', 6, 'IT Manager', 210000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'])
df.sort_values(by = 'Monthly Salary', inplace = True)
print(df.columns)

Output

This output shows retriving the Pandas dataframe in Python

 

Pandas DataFrame Size

Often it becomes essential for us to check the size and dimension of the DataFrame.

That is where the attributes .size, .ndim, and .shape help in returning the number of data values, data values and overall shape (total number of data values) of the DataFrame.

Here is a program that shows how to use it.

import pandas as pd
d = [['Karlos', 1, 'CTO', 530000],
     ['Ray', 2, 'CMO', 410000],
     ['Sue', 3, 'Marketing Head', 370000],
     ['Bill', 4, 'Security Head', 400000],
     ['Dee', 5, 'CFO', 320000],
     ['Lee', 6, 'IT Manager', 210000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'])
print(df.size)
print(df.ndim)
print(df.shape)

Output
This output shows the size of the Pandas dataframe in Python
If you want to check the size of the DataFrame in terms of memory usage, you can do so using the memory_usage() method. Here is a program that shows how to use it.

import pandas as pd
d = [['Karlos', 1, 'CTO', 530000],
     ['Ray', 2, 'CMO', 410000],
     ['Sue', 3, 'Marketing Head', 370000],
     ['Bill', 4, 'Security Head', 400000],
     ['Dee', 5, 'CFO', 320000],
     ['Lee', 6, 'IT Manager', 210000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'])
print(df.memory_usage())

Output

This output shows the size of the Pandas dataframe in Python

 

Filling Missing Data

Data cleaning and managing so we can get appropriate results often take a lot of time. If we find missing data in a dataset, we might need to fill it with some arbitrary value that does not hamper the overall operation.

Pandas provide data manipulation tools and techniques to fix missing data via dropping or filling them with some other value. The various methods are:
Method 1: Using the fillna() method This method is a powerful way to fill in all blank data. It will iterate over all the rows throughout the dataset and fill them with a specific value.

If you determine the value in the value parameter, it will fill all the gaps with that value. Also, make sure to set the inplace parameter as True. Here is a program showing how to use it.

import pandas as pd
d = [['Karlos', 1, 'CTO', 530000],
     ['Ray', 2, 'CMO', 410000],
     ['Sue', 3, 'Marketing Head'],
     ['Bill', 4, 'Security Head', 400000],
     ['Dee', 5, 'CFO', 320000],
     ['Lee', 6, 'IT Manager']]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'])
df.fillna(value=0, inplace=True)
print(df)

Output

This output shows how to fill missing data in Pandas dataframe in Python
We can also change the method parameters like ffill and bfill.

  • The ffill will fill the missing value with the value of the nearest parameter above it.
  • The bfill will fill the missing value of the DataFrame with the nearest value below it.

Using ffill:

import pandas as pd
d = [['Karlos', 1, 'CTO', 530000],
     ['Ray', 2, 'CMO', 410000],
     ['Sue', 3, 'Marketing Head'],
     ['Bill', 4, 'Security Head', 400000],
     ['Dee', 5, 'CFO', 320000],
     ['Lee', 6, 'IT Manager']]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'])
df.fillna(method='ffill', inplace = True)
print(df)

Output

This output shows how to fill missing data in Pandas dataframe in Python
Using bfill:

import pandas as pd
d = [['Karlos', 1, 'CTO', 530000],
     ['Ray', 2, 'CMO', 410000],
     ['Sue', 3, 'Marketing Head'],
     ['Bill', 4, 'Security Head', 400000],
     ['Dee', 5, 'CFO', 320000],
     ['Lee', 6, 'IT Manager', 367000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'])
df.fillna(method='bfill', inplace=True)
print(df)

Output

This output shows how to fill missing data in Pandas dataframe in Python
Method 2: Using the replace() method: With the replace() you can replace the NaN values of specific columns by simply specifying the column name, along with the replace() method, and pass the parameters np.nan to get replaced with your specific value. Here is a program showing how to use it.

import pandas as pd
import numpy as np
nums = {'Smartphone Model Number': [1008, np.nan, 2317, 1195, np.nan,
                             424, 1110, 10, np.nan, 90, 1120,
                             np.nan],
        'Series': ["A", np.nan, "V", "Ax", "G", "Pro",
                          "Promax", "Go Pro", "X", "Z", "S", "Y"]}
df = pd.DataFrame(nums, columns=['Smartphone Model Number'])

df['Smartphone Model Number'] = df['Smartphone Model Number'].replace(np.nan, 0)
print(df)

Output

This output shows how to fill missing data in Pandas dataframe in Python
Method 3: Using interpolate() method This method will use the existing values in the DataFrame to estimate and place the values in the missing rows.

It uses the linear method to perform the interpolation. We can also set the limit_direction parameter as backward or forward. Here is a program showing how to use it.

import pandas as pd
import numpy as np
nums = {'Smartphone Model Number': [1008, np.nan, 2317, 1195, np.nan,
                             424, 1110, 10, np.nan, 90, 1120,
                             612],
        'Series': ["A", np.nan, "V", "Ax", "G", "Pro",
                          "Promax", "Go Pro", "X", "Z", "S", "Y"]}
df = pd.DataFrame(nums)
df.interpolate(method = 'linear', limit_direction = 'backward', inplace = True)
print(df)

Output

This output shows how to fill missing data in Pandas dataframe in Python

 

Accessing and Modifying Data

We can access and modify a single value by specifying the column and index name within [] brackets and using the = equal operator to set new values.

Again to change a single value, we can use the dataframe.iloc[]. Here is a program showing both the ways of doing it.

import pandas as pd
dict = {'DataScience': [86, 93, 78], 'MachineLearning': [80, 91, 96], 'DeepLearning': [70, 80, 93], 'Python': [93, 81, 91]}
df = pd.DataFrame(dict, index = ['2020', '2021', '2022'])
print(df)
print('nn')
print('Modifying the value of Machine learning for 2021 value:')
use_name = df.MachineLearning['2021'] = 99
print(df)
print('nn')
print('Modify a single value through index:')
use_index = df.iloc[2, 1] = 84
print(df)

Output

This output shows how to access and modify data in Pandas dataframe in Python
We can also access and modify multiple values in a DataFrame by specifying the column name and assigning a list of values to it. One technique is using the dataframe.assign().

The other way is by specifying the column name like this dataframe[column_name] = list_of_values. Here is a program showing both the ways of doing it.

import pandas as pd
dict = {'DataScience': [86, 93, 78], 'MachineLearning': [80, 91, 96], 'DeepLearning': [70, 80, 93], 'Python': [93, 81, 91]}
df = pd.DataFrame(dict, index = ['2020', '2021', '2022'])
print(df)
print('Modifying the value of Machine learning for 2021 value:')
df2 = df.assign(Python = [94, 80, 69])
print(df2)
print('Modifying all the value of Data Science by accessing using df[]:')
df['DataScience'] = [97, 79, 83]
print(df)

Output

This output shows how to access and modify data in Pandas dataframe in Python

 

Inserting and Deleting Data

There are different ways to insert and delete data from a DataFrame. We can insert or delete new rows or new columns to handle a DataFrame’s data.

We can insert a new row using the append() method. Here is a code snippet showing how to use it.

import pandas as pd
df = pd.read_csv('C:\Users\Gaurav\dataset.csv')
print(df)
print()
Mark = pd.Series(data = ['Mark', 'Berlin', 41, 89], index = df.columns, name = 108)
print(Mark)
print()
df = df.append(Mark)
print(df)

Output

This output shows how to insert and delete data in Pandas dataframe in Python

import pandas as pd
df = pd.read_csv('C:\Users\Gaurav\dataset.csv')
print(df)
print()
df = df.drop(labels = [107])
print(df)

Output

This output shows how to insert and delete data in Pandas dataframe in Python

Also, we can use the insert() method to insert a new column in an existing DataFrame. Here is a code snippet showing how to use it.

import pandas as pd
import numpy as np
df= pd.read_csv('C:\Users\Gaurav\dataset.csv')
print(df)
print()
df.insert(loc = 4, column = 'salary', value = np.array(['86K', "82K", "76K", "57K", "39K", "72K", "61K"]))
print(df)

Output

This output shows how to insert and delete data in Pandas dataframe in Python

import pandas as pd
import numpy as np
df= pd.read_csv('C:\Users\Gaurav\dataset.csv')
print(df)
print()
del df['age']
print(df)

Output

This output shows how to insert and delete data in Pandas dataframe in Python

 

Applying Arithmetic Operations

DataFrame also allows performing arithmetic operations like addition, subtraction, multiplication, and division, through its different column values. Here is a program showing how to use it.

import pandas as pd
d = [['Karlos', 1, 'CTO', 530000],
     ['Ray', 2, 'CMO', 410000],
     ['Sue', 3, 'Marketing Head'],
     ['Bill', 4, 'Security Head', 400000],
     ['Dee', 5, 'CFO', 320000],
     ['Lee', 6, 'IT Manager', 367000]]
df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'])
print(df)
print()
print(df['Emp ID'] + df['Monthly Salary'])

Output

This output shows how to apply Arithmetic Operations in Pandas dataframe in Python

 

Visualizing DataFrame values

Data visualizations have become a separate domain in data analysis. It helps in visualizing data and better interacting with the granular data extracted through various DataFrame operations.

Usually, we perform the data visualizations using the Matplotlib library. Here is a program showing how to portray the two marks of all these candidates.

import matplotlib.pyplot as plt
df = pd.read_csv('C:\Users\Gaurav\dataset.csv')
df.plot()
plt.show()

Output
This output shows how to fill missing data in Pandas dataframe in Python

Conclusion

Data handling in data science has become a critical phase that one needs to hone to extract granular insights with proficiency.

For data wrangling, data cleansing, and other data management operations, one should know how to extract data from CSV or other file formats and store and handle them efficiently through data structures like DataFrames.

That is where Pandas provide the DataFrame to manage data on a massive scale effectively. It comes with various pre-defined methods and techniques that make data handling easy.

This article highlighted the significant methods, techniques, and manipulating practices we can leverage while performing data analysis.

Leave a Reply

Your email address will not be published.