Python Pandas DataFrame tutorial
Data analysis has become an essential aspect of every business. Since data has become the new oil, it is paramount to cultivate the most effective form of data to get granular insight into the business.
Pandas is a Python library that is open-source and allows performing various data analysis tasks. It helps in handling structured data through its well-known data structures. In this article, we will discuss DataFrames and how to use them in our program.
Why do we use Pandas?
Pandas is a prominent library for data analysis and handling large and complex structured data and its various situations through DataFrames.
It helps in data filtering, cleansing, arranging data in a meaningful order, etc. Any real-world data with heavy mess gets sorted and cleaned using Pandas. Hence, it is one of the essential early-stage tools for data handling.
1D and 2D data structures in Pandas
There are two prevalent data structures that Pandas come with for data analysis. These are Series and DataFrame. While Series is a one-dimensional labeled array that can hold data of any type (integer, string, float, python objects, etc.), DataFrame is two-dimensional and size mutable.
These data structures define how the data will get arranged for analysis through Pandas. Let us understand DataFrame in detail in this article.
What is DataFrame?
A DataFrame is a size-mutable, two-dimensional data structure that can store heterogeneous values with labeled axes (rows and columns).
This two-dimensional data structure remains aligned in a tabular fashion through rows and columns like that of a spreadsheet or database table.
We can create DataFrames using Python list, dictionary, Pandas Series, NumPy array, files (CSV), etc. Let us now check how to create a DataFrame using all of these.
We have to use the Pandas module to create DataFrame. The pandas.DataFrame() is used to create the dataframe object in Python.
Creating a DataFrame from a Python List
We can use a nested list to create a DataFrame through Pandas. Each inner list will represent a row for the DataFrame. Here is a code snippet showing how to create a DataFrame using a list.
import pandas as pd list1 = [ [50, "French Fry", "Not spicy"], [500, "Pizza", "With Oregano"], [5000, "Chocolate Shake", "with choco crisps"] ] df1 = pd.DataFrame(list1) print(df1)
Output
Creating a DataFrame from Dictionaries
We can also create a DataFrame using the Python dictionary. In this case, within the dictionary, we can provide the list of values where each list will represent the column values.
Here is a code snippet showing how to create a DataFrame using a dictionary containing lists.
import pandas as pd dictDat = { 'Scoville' : [50, 500, 5000], 'Name' : ["French Fry", "Pizza", "Chocolate Shake"], 'Feeling' : ["Not spicy", "With Oregano", "with choco crisps"] } df1 = pd.DataFrame(dictDat) print(df1)
Output
Creating a DataFrame from NumPy Arrays
We can also use the NumPy array (ndarray) to create a DataFrame. Since NumPy is also a part of data analysis and a popular module for data science, data analysts and professionals prefer to use this means for creating DataFrames. We have to import the NumPy module and create a NumPy array first.
Then, we can use the panda.DataFrame() and pass the NumPy array to convert it to a DataFrame. Here is a code snippet showing how to create a DataFrame using a NumPy array.
import numpy as np import pandas as pd arry = np.array([[1, 2, 100],[2, 4, 100], [3, 8, 100]]) df1 = pd.DataFrame(arry, columns = ['a', 'b', 'c']) print(df1)
Output
Creating a DataFrame from Series
It is possible to create DataFrame from multiple Series. We have to create the Series from lists and then use those Series as dictionary values.
The dictionary object will then be passed to the pandas.DataFrame() for converting it to a DataFrame. Here is a code snippet showing how to implement it.
import pandas as pd researcher = ['Karlos', 'Ray', 'Ramanujan', 'Dee'] papers = [211, 118, 97, 162] # Creating two Series from lists r_series = pd.Series(researcher) paper_research = pd.Series(papers) # Creating a dictionary through pandas Series objects as values dframe = {'Researchers': r_series, 'Papers': paper_research} # Creating DataFrame with the help of dictionary result = pd.DataFrame(dframe) print(result)
Output
Creating a DataFrame from Files
Pandas support different types of files for creating DataFrame. The most supported files are CSV, JSON, XLS, HTML, etc.
The method used to read them are: read_csv(), read_json(), read_excel(), read_html() respectively. The most common file type is the CSV file.
Here is a code snippet showing how to open and display a file data as DataFrame:
import pandas as pd pepperDataFrame = pd.read_csv('busses_emp.csv') print(pepperDataFrame)
Output
Adding columns to a DataFrame
There are different ways to add a new column to an existing DataFrame. These are:
Method 1: Using a list
We can create a new column for our existing DataFrame through a list and associate it with the existing DataFrame object. If the number of elements in the list mismatches, it will create an error. Here is a code showing how to implement it.
import pandas as pd d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'], 'Height': [6.1, 6.0, 5.9, 5.6], 'Qualification': ['MS', 'MBA', 'MCom', 'MCA']} df = pd.DataFrame(d) print(df) print() # appending a new column name Designation with the values given as list. designation = ['Tech Content Writer', 'Data Analyst', 'Software Developer', 'Cloud Architect'] df['Designation'] = designation print(df)
Output
Method 2: Using the insert() method
Another way to create and insert another column to an existing DataFrame is through the insert() method. Here is a code snippet showing how to implement it.
import pandas as pd d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'], 'Height': [6.1, 6.0, 5.9, 5.6], 'Qualification': ['MS', 'MBA', 'MCom', 'MCA']} df = pd.DataFrame(d) print(df) print() df.insert(2, "Designation ", ['Tech Content Writer', 'Data Analyst', 'Software Developer', 'Cloud Architect'], True) print(df)
Output
Method 3: Using the assign() method
Using this method, we can create a new DataFrame having a new column and assign it to an old DataFrame. Here is a code snippet showing how to implement it.
import pandas as pd d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'], 'Height': [6.1, 6.0, 5.9, 5.6], 'Qualification': ['MS', 'MBA', 'MCom', 'MCA']} df = pd.DataFrame(d) print(df) print() df2 = df.assign(Location=['Noida', 'Bangalore', 'Hyderabad', 'Jaipur']) print(df2)
Output
Method 4: Using a dictionary
We can use an independent dictionary and directly use it to add a new column in the Pandas DataFrame. Here is a code snippet showing how to implement it.
import pandas as pd d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'], 'Height': [6.1, 6.0, 5.9, 5.6], 'Qualification': ['MS', 'MBA', 'MCom', 'MCA']} df = pd.DataFrame(d) print(df) print() Location = {'Noida': 'Karl', 'Bangalore': 'Iris', 'Hyderabad': 'Ray', 'Jaipur': 'Dee'} df['Location'] = Location print(df)
Output
Deleting Columns from a DataFrame
There are different ways to delete a column from a DataFrame. These are:
Method 1: Using the drop()
We can use the DataFrame’s drop() method for removing columns specifying the column (label) names or their corresponding axis. Here is a code snippet showing how to implement it.
import pandas as pd d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'], 'Height': [6.1, 6.0, 5.9, 5.6], 'Qualification': ['MS', 'MBA', 'MCom', 'MCA']} df = pd.DataFrame(d) Location = {'Noida': 'Karl', 'Bangalore': 'Iris', 'Hyderabad': 'Ray', 'Jaipur': 'Dee'} df['Location'] = Location print(df) df.drop('Height', inplace = True, axis = 1) print(df)
Output
Method 2: Using the del keyword
We can use the del keyword to delete a column. In this technique, we have to specify the DataFrame name, along with the column name (label). Here is a code snippet showing how to implement it.
import pandas as pd d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'], 'Height': [6.1, 6.0, 5.9, 5.6], 'Qualification': ['MS', 'MBA', 'MCom', 'MCA'], 'Location':['Noida', 'Bangalore', 'Hyderabad', 'Jaipur']} df = pd.DataFrame(d) print(df) del df['Qualification'] print(df)
Output
Method 3: using the pop() method
The pop() method also helps in popping out another method from the DataFrame and will return the column. Here is a code snippet showing how to implement it.
import pandas as pd d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'], 'Height': [6.1, 6.0, 5.9, 5.6], 'Qualification': ['MS', 'MBA', 'MCom', 'MCA'], 'Location':['Noida', 'Bangalore', 'Hyderabad', 'Jaipur']} df = pd.DataFrame(d) print(df) df.pop('Height') print(df)
Output
Iterating Over a Pandas DataFrame
It is essential to know how to iterate over the items of DataFrame. There are different methods and ways to loop over a Pandas DataFrame.
Each of them has different mechanisms for displaying or extracting the data. Data analysts and professionals can use it at their convenience.
Method 1: Using a normal for loop
We can iterate over the different keys of the dictionary to extract the column names. The code will look something like this.
import pandas as pd d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'], 'Height': [6.1, 6.0, 5.9, 5.6], 'Qualification': ['MS', 'MBA', 'MCom', 'MCA'], 'Location':['Noida', 'Bangalore', 'Hyderabad', 'Jaipur']} df = pd.DataFrame(d) for col_name in df: print(type(col_name)) print(col_name) print('--------n')
Output
Method 2: Using the items() method
We can use the items() method of the DataFrame that returns a generator. Using this, we can fetch every column name and value and display it using a for loop. Here is the code showing how to use it.
import pandas as pd d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'], 'Height': [6.1, 6.0, 5.9, 5.6], 'Qualification': ['MS', 'MBA', 'MCom', 'MCA'], 'Location':['Noida', 'Bangalore', 'Hyderabad', 'Jaipur']} df = pd.DataFrame(d) for col_name, dat in df.items(): print("column name:", col_name, "n",dat)
Output
Method 3: Using iterrows() method
The iterrows() is also a DataFrame’s iterable method that loop over each row of the DataFrame column-wise. We can use this method to fetch the data of each row of an index. Here is the code showing how to use it.
import pandas as pd d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'], 'Height': [6.1, 6.0, 5.9, 5.6], 'Qualification': ['MS', 'MBA', 'MCom', 'MCA'], 'Location':['Noida', 'Bangalore', 'Hyderabad', 'Jaipur']} df = pd.DataFrame(d) for g, r in df.iterrows(): print(f"Index Value ->: {g}") print(f"{r}n")
Output
Method 4: Using the itertuples() method
With the help of this method, we can retrieve a column of index names and their respective data from the DataFrame for that row, one row at a time. Here is the code showing how to use it.
import pandas as pd d = {'Name': ['Karl', 'Iris', 'Ray', 'Dee'], 'Height': [6.1, 6.0, 5.9, 5.6], 'Qualification': ['MS', 'MBA', 'MCom', 'MCA'], 'Location':['Noida', 'Bangalore', 'Hyderabad', 'Jaipur']} df = pd.DataFrame(d) for r in df.itertuples(): print(type(r)) print(r) print('--------')
Output
Indexing DataFrames
The pandas.DataFrame() contains the index parameter that takes a list to index all its columns with the specified value given within the list.
Here is the code to show how to implement indexing in a DataFrame.
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head', 370000], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 210000],] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'], index=['A', 'B', 'C', 'D', 'E', 'F']) print(df)
Output
Slicing DataFrames
Slicing a data structure means dividing the data structure into smaller sub-parts to extract specific values from a wide range of values. Pandas allow different ways to extract sliced data from the DataFrame.
Method 1: Slicing out a single column
To extract a single column value from the DataFrame, we can use the column value in different ways to slice & extract the entire column out of the DataFrame.
Here is a code snippet showing how to implement it:
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head', 370000], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 210000],] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'], index=['A', 'B', 'C', 'D', 'E', 'F']) # technique number 1 print(df.loc[:,"Designation"]) # technique number 2 print(df["Monthly Salary"]) # technique number 3 print(df.Name)
Output
Method 2: Select and slice out multiple columns
There is another way to slice multiple DataFrame columns using the dataframe.loc[]. Here is a code snippet showing how to implement it.
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head', 370000], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 210000],] df = df = pd.DataFrame(d, columns=['Name', 'Emp ID', 'Designation', 'Monthly Salary'], index=['A', 'B', 'C', 'D', 'E', 'F']) # technique number 1 print(df.loc[:,["Designation", "Monthly Salary"]]) # technique number 2 print(df[["Name", "Monthly Salary"]])
Output
Various Built-in DataFrame Methods
To ease the task of handling data within a DataFrame, Pandas allows us to implement various methods. These are:
- read_csv(): It is a popular method that helps us extract data from a CSV (Comma Separated Value) file and store it into a Pandas DataFrame to use and extract insight from it. All you need to do is mention the local path name of your CSV file as a string. This method can also read files that have delimiters like tab or pipe (|) apart from the comma (,). Here is a code snippet showing how to use it.
import pandas as pd busDf = pd.read_csv('busses_emp.csv') print(busDf)
Output
- head(): The head(n) is a popular function of the Pandas that work closely with the DataFrame objects to return the first ‘n’ rows of a dataset. If you do not pass any value as a parameter, it will return the first five rows (top 5 rows) from the DataFrame. If you want it to return more or fewer rows from the DataFrame, you have to specify the number of rows. Here is a code showing you how to use it.
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head', 370000], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 210000],] df = df = pd.DataFrame(d, columns=['Name', 'Emp ID', 'Designation', 'Monthly Salary'], index=['A', 'B', 'C', 'D', 'E', 'F']) # technique number 1 print(df.head()) print() # technique number 1 print(df.head(3))
Output
- tail(): The tail(n) is another popular function of the Pandas DataFrame objects that returns the last ‘n’ rows of a dataset. If you do not pass any value as a parameter, it will return the last five rows (bottom 5 rows) from the DataFrame. If you want it to return more or fewer rows from the DataFrame, you have to specify the number of rows. Here is a code showing you how to use it.
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head', 370000], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 210000],] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'], index=['A', 'B', 'C', 'D', 'E', 'F']) # technique number 1 print(df.tail()) print() # technique number 1 print(df.tail(2))
Output
- describe(): This method helps in generating descriptive statistics of the data that is there in the DataFrame. With this function, we can get a quick overview and dispersed view of the dataset. Here is an example showing how it represents the basic statistical data.
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head', 370000], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 210000],] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary'], index=['A', 'B', 'C', 'D', 'E', 'F']) print(df.describe())
Output
- memory_usage(): This method helps in returning a pandas Series that will hold the memory use of each column (in bytes) of the DataFrame. Once we define the deep attribute as True, we can get the details about the memory getting occupied by every column. Here is an example that shows how to implement it.
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head', 370000], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 210000],] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary']) print(df.memory_usage(deep = True))
Output
- to_datetime(): This method helps in converting a Python object to datetime format. This method can accept lists, floating point values, integers, Series, and DataFrames as its argument. We can use this robust method for datasets that has time series values & dates within them. Here is an example that shows how to implement it.
import pandas as pd d = pd.read_csv("datetimedata.csv") d["Date"] = pd.to_datetime(d["Date"]) print(d.info()) print(d)
Output
- value_counts(): This method helps in returning a Series data structure having the total of all unique values. If you have a dataset containing employee information about 3,000 employees of an organization. value_counts() will identify and return the occurrences of each value in that data structure.
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head', 370000], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 210000],] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary']) print(df.value_counts())
Output
- drop_duplicates(): If you want to remove any duplicate values from the rows in a DataFrame, this method can help. This method also provides an option to keep the first matched occurrence of the duplicate or the last. You can also specify the inplace as True & ignore_index parameters. Here is a code snippet showing how to use it.
import pandas as pd df = pd.DataFrame({ 'Brand': ['Zomato', 'Sugar', 'Zomato', 'Tata', 'Tata'], 'Items': ['Noodles', 'Nail Polish', 'Noodles', 'Car', 'pack'] }) print(df.drop_duplicates())
Output
- groupby(): You can use the groupby() method to group DataFrame values by one or more columns & perform mathematical functions on them. It also helps in summarizing data in an easy-to-understand way. Here is a code snippet showing how to use it.
import pandas as pd df = pd.read_csv("nba.csv") dat = df.groupby(['Team', 'Age']) print(dat.first())
Output
- merge(): This method helps merge two Series or DataFrame objects under a common field name. Here is a code snippet showing how to use it.
import pandas as pd df1 = pd.DataFrame({'Company1': ['Item1', 'Item2', 'Item3', 'Item4'], 'val': [1, 2, 3, 5] }) df2 = pd.DataFrame({'Company2': ['Item1', 'Item2', 'Item3', 'Item4'], 'val': [5, 6, 7, 8] }) print(df1.merge(df2, left_on = 'Company1', right_on = 'Company2'))
Output
- sort_values(): As the name suggests, it will sort the DataFrame’s column values in increasing or decreasing order. Here is a code snippet showing how to implement it.
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head', 370000], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 210000]] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary']) df.sort_values(by = 'Monthly Salary', inplace = True) print(df)
Output
- fillna(): In a large dataset, we often find datasets that have lots of NaN (Not a Number). This method helps in replacing all the NaN or missing values of the Series or DataFrame with more reasonable values. Here is a code snippet showing how to implement it.
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head'], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager']] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary']) df.fillna(230000, inplace = True) print(df)
Output
- read_json(): We often use the JavaScript Object Notation (JSON) for exchanging data on the web. It is an obvious way to return or accept JSON format for communicating or fetching data from the backend of an application. If you want to extract the JSON data and convert it to DataFrame, Pandas provide the read_json() method to do so. Here is a code snippet showing how to implement it.
import pandas as pd # json string assigned str = '{"column 1" : {"rowA":1,"rowB":2,"rowC":3}, "column 2" : {"rowA":"Karl","rowB":"Dee","rowC":"Ray"}}' # reading JSON data and converting it to DataFrame df = pd.read_json(str) print(df)
Output
Retrieving Labels from DataFrames
There are two different ways to retrieve labels of a DataFrame. One of them is using the .index (to fetch the rows) and .column (to fetch the columns).
Here are two code snippets showing how to implement them. dataframe.index:
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head', 370000], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 210000]] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary']) df.sort_values(by = 'Monthly Salary', inplace = True) print(df.index)
Output
dataframe.column:
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head', 370000], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 210000]] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary']) df.sort_values(by = 'Monthly Salary', inplace = True) print(df.columns)
Output
Pandas DataFrame Size
Often it becomes essential for us to check the size and dimension of the DataFrame.
That is where the attributes .size, .ndim, and .shape help in returning the number of data values, data values and overall shape (total number of data values) of the DataFrame.
Here is a code that shows how to use it.
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head', 370000], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 210000]] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary']) print(df.size) print(df.ndim) print(df.shape)
Output
If you want to check the size of the DataFrame in terms of memory usage, you can do so using the memory_usage() method. Here is a code that shows how to use it.
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head', 370000], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 210000]] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary']) print(df.memory_usage())
Output
Filling Missing Data
Data cleaning and managing so we can get appropriate results often take a lot of time. If we find missing data in a dataset, we might need to fill it with some arbitrary value that does not hamper the overall operation.
Pandas provide data manipulation tools and techniques to fix missing data via dropping or filling them with some other value. The various methods are:
Method 1: Using the fillna() method
This method is a powerful way to fill in all blank data. It will iterate over all the rows throughout the dataset and fill them with a specific value.
If you determine the value in the value parameter, it will fill all the gaps with that value. Also, make sure to set the inplace parameter as True. Here is a code showing how to use it.
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head'], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager']] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary']) df.fillna(value=0, inplace=True) print(df)
Output
We can also change the method parameters like ffill and bfill.
- The ffill will fill the missing value with the value of the nearest parameter above it.
- The bfill will fill the missing value of the DataFrame with the nearest value below it.
Using ffill:
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head'], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager']] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary']) df.fillna(method='ffill', inplace = True) print(df)
Output
Using bfill:
import pandas as pd d = [['Karlos', 1, 'CTO', 530000], ['Ray', 2, 'CMO', 410000], ['Sue', 3, 'Marketing Head'], ['Bill', 4, 'Security Head', 400000], ['Dee', 5, 'CFO', 320000], ['Lee', 6, 'IT Manager', 367000]] df = df = pd.DataFrame(d, columns = ['Name', 'Emp ID', 'Designation', 'Monthly Salary']) df.fillna(method='bfill', inplace=True) print(df)
Output
Method 2: Using the replace() method
With the replace() you can replace the NaN values of specific columns by simply specifying the column name, along with the replace() method, and pass the parameters np.nan to get replaced with your specific value. Here is a code showing how to use it.
import pandas as pd import numpy as np nums = {'Smartphone Model Number': [1008, np.nan, 2317, 1195, np.nan, 424, 1110, 10, np.nan, 90, 1120, np.nan], 'Series': ["A", np.nan, "V", "Ax", "G", "Pro", "Promax", "Go Pro", "X", "Z", "S", "Y"]} df = pd.DataFrame(nums, columns=['Smartphone Model Number']) df['Smartphone Model Number'] = df['Smartphone Model Number'].replace(np.nan, 0) print(df)
Output
Method 3: Using interpolate() method
This method will use the existing values in the DataFrame to estimate and place the values in the missing rows.
It uses the linear method to perform the interpolation. We can also set the limit_direction parameter as backward or forward. Here is a code showing how to use it.
import pandas as pd import numpy as np nums = {'Smartphone Model Number': [1008, np.nan, 2317, 1195, np.nan, 424, 1110, 10, np.nan, 90, 1120, 612], 'Series': ["A", np.nan, "V", "Ax", "G", "Pro", "Promax", "Go Pro", "X", "Z", "S", "Y"]} df = pd.DataFrame(nums) df.interpolate(method = 'linear', limit_direction = 'backward', inplace = True) print(df)
Output
Accessing and Modifying Data
We can access and modify a single value by specifying the column and index name within [] brackets and using the = equal operator to set new values.
Also, we can change a single value using the dataframe.iloc[].
import pandas as pd dict = {'DataScience': [86, 93, 78], 'MachineLearning': [80, 91, 96], 'DeepLearning': [70, 80, 93], 'Python': [93, 81, 91]} df = pd.DataFrame(dict, index = ['2020', '2021', '2022']) print(df) print('nn') print('Modifying the value of Machine learning for 2021 value:') use_name = df.MachineLearning['2021'] = 99 print(df) print('nn') print('Modify a single value through index:') use_index = df.iloc[2, 1] = 84 print(df)
Output
We can also access and modify multiple values in a DataFrame by specifying the column name and assigning a list of values to it. One technique is using the dataframe.assign().
The other way is by specifying the column name like this dataframe[column_name] = list_of_values. Here is a code showing both ways of doing it.
import pandas as pd dict = {'DataScience': [86, 93, 78], 'MachineLearning': [80, 91, 96], 'DeepLearning': [70, 80, 93], 'Python': [93, 81, 91]} df = pd.DataFrame(dict, index = ['2020', '2021', '2022']) print(df) print('Modifying the value of Machine learning for 2021 value:') df2 = df.assign(Python = [94, 80, 69]) print(df2) print('Modifying all the value of Data Science by accessing using df[]:') df['DataScience'] = [97, 79, 83] print(df)
Output
Inserting and Deleting Data
There are different ways to insert and delete data from a DataFrame. We can insert or delete new rows or new columns to handle a DataFrame’s data.
Using append
We can insert a new row using the append() method. Here is a code snippet showing how to use it.
import pandas as pd df = pd.read_csv('C:\Users\Gaurav\dataset.csv') print(df) print() Mark = pd.Series(data = ['Mark', 'Berlin', 41, 89], index = df.columns, name = 108) print(Mark) print() df = df.append(Mark) print(df)
Output
Using insert
Also, we can use the insert() method to insert a new column in an existing DataFrame. Here is a code snippet showing how to use it.
import pandas as pd import numpy as np df= pd.read_csv('C:\Users\Gaurav\dataset.csv') print(df) print() df.insert(loc = 4, column = 'salary', value = np.array(['86K', "82K", "76K", "57K", "39K", "72K", "61K"])) print(df)
Output
Delete row
You can delete a row using the drop method like this:
import pandas as pd df = pd.read_csv('C:\Users\Gaurav\dataset.csv') print(df) print() df = df.drop(labels = [107]) print(df)
Output
import pandas as pd import numpy as np df= pd.read_csv('C:\Users\Gaurav\dataset.csv') print(df) print() del df['age'] print(df)
Output
Visualizing DataFrame values
Data visualizations have become a separate domain in data analysis. It helps in visualizing data and better interacting with the granular data extracted through various DataFrame operations.
Usually, we perform the data visualizations using the Matplotlib library. Here is a code showing how to portray the two marks of all these candidates.
import matplotlib.pyplot as plt df = pd.read_csv('C:\Users\Gaurav\dataset.csv') df.plot() plt.show()
Output
Conclusion
Data handling in data science has become a critical phase that one needs to hone to extract granular insights with proficiency.
For data wrangling, data cleansing, and other data management operations, one should know how to extract data from CSV or other file formats and store and handle them efficiently through data structures like DataFrames.
That is where Pandas provide the DataFrame to manage data on a massive scale effectively. It comes with various pre-defined methods and techniques that make data handling easy.
This article highlighted the significant methods, techniques, and manipulating practices we can leverage while performing data analysis.
Mokhtar is the founder of LikeGeeks.com. He works as a Linux system administrator since 2010. He is responsible for maintaining, securing, and troubleshooting Linux servers for multiple clients around the world. He loves writing shell and Python scripts to automate his work.