Read CSV using Python Pandas read_csv

In this tutorial, we will focus on the read_csv function, which allows us to read CSV files and convert them into a Pandas DataFrame.

The read_csv function is more than just a CSV file loading tool. It is equipped with a multitude of parameters to handle a wide range of scenarios that might occur in a CSV file.

 

 

What is Pandas read_csv?

Whether it’s handling skipping unnecessary rows, getting specific columns, defining column data types, parsing date columns, handling blank lines, or even dealing with null values, read_csv has got you covered.

You’ll learn how to read different CSV files in different ways, so let’s get started.

 

Specifying the Data Source

The first argument for pandas read_csv is a string path to the CSV file we want to read. This string could be a URL, a file path, or any valid string name representing the source of the data.
Let’s load from a CSV file some telecom company data.

df = pd.read_csv('telecom_data.csv')
print(df.head())

Output:

   CustomerID  Gender  SeniorCitizen Partner Dependents  tenure  PhoneService
0        7590    Male              0      No         No       1            Yes
1        5575    Male              0      No         No      34             No
2        3668  Female              0      No         No       2            Yes
3        7795  Female              0      No         No      45             No
4        9237  Female              0      No         No       2            Yes

Here, we use the read_csv function of pandas to load the data from our CSV file, and then we print the first 5 rows using the .head() function.

 

Customizing Delimiters

Sometimes, data may not be separated by commas (the default separator) in the CSV files. The ‘sep’ parameter allows you to specify the delimiter that separates the data.

I converted the sample file to a tab-separated to demonstrate how to handle files with different separators and it looks like this:

CustomerID	Gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService
7590	Male	0	No	No	1	Yes
5575	Male	0	No	No	34	No
3668	Female	0	No	No	2	Yes
7795	Female	0	No	No	45	No
9237	Female	0	No	No	2	Yes

Now, let’s read using read_csv:

df = pd.read_csv('telecom_data.tsv', sep='\t')
print(df.head())

Output:

   CustomerID  Gender  SeniorCitizen Partner Dependents  tenure PhoneService
0        7590    Male              0      No         No       1          Yes
1        5575    Male              0      No         No      34           No
2        3668  Female              0      No         No       2          Yes
3        7795  Female              0      No         No      45           No
4        9237  Female              0      No         No       2          Yes

In this case, ‘\t’ (tab) is used as the delimiter. If the delimiter is not specified correctly, pandas will interpret the entire row as a single column.

 

Selecting Data Headers

By default, pandas infer the column names from the first line of the file. But sometimes, you might want to use a different row as your headers. You can achieve this by setting the ‘header’ parameter.

df = pd.read_csv('telecom_data.csv', header=1)
print(df.head())

Output:

   7590    Male  0   No  No  1   Yes
0  5575    Male  0   No  No 34   No
1  3668  Female  0   No  No  2  Yes
2  7795  Female  0   No  No 45   No
3  9237  Female  0   No  No  2  Yes
4  9305    Male  0  Yes  No  8  Yes

Now, pandas will infer the column names from the second line (indexing starts from 0), not the first line.

For files with multi-index column names, you can pass a list to the header parameter to specify the rows to use.

data = StringIO("Header1,Header2,Header3\nSubHeader1,SubHeader2,SubHeader3\nA,B,C\n1,2,3")
df = pd.read_csv(data, header=[0,1])
print(df)

Output:

  Header1 SubHeader1  Header2 SubHeader2  Header3 SubHeader3
0       A          B        C          1        2          3

Note: if you have data in the first row, remember to set header=0 otherwise, it will be treated as column names.

 

Setting the DataFrame Index

The ‘index_col’ parameter specifies the column to use as the row labels of the DataFrame. It can either be given as string name or column index.

df = pd.read_csv('telecom_data_2.csv', index_col='customer_id')
print(df.head())

Output:

             gender  age  monthly_charges  churn
customer_id                                    
1              Male   34             56.2     No
2            Female   45             74.6    Yes
3            Female   23             66.7     No
4              Male   51             86.4    Yes
5            Female   30             54.6     No

Here, the ‘customer_id’ column is used as the index of the DataFrame.

 

Skipping Rows

The ‘skiprows’ parameter allows you to skip certain rows. This can be a list of integers specifying the row numbers, or a callable function that is evaluated against the row indices, returning true if the row should be skipped.

df = pd.read_csv('telecom_data_2.csv', skiprows=[1,2,3])
print(df.head())

Output:

   customer_id  gender  age  monthly_charges churn
0            4    Male   51             86.4   Yes
1            5  Female   30             54.6    No
2            6    Male   45             78.5   Yes
3            7  Female   25             56.6    No
4            8    Male   35             87.6   Yes

In this case, rows 1, 2, and 3 (2nd, 3rd, and 4th in human count) are skipped.

Conditional Skip

You can also pass a callable function to skip specific rows where the index matched the criteria:

df = pd.read_csv('telecom_data.csv', skiprows=lambda i: i % 2 == 0)
print(df)

Output:

   CustomerID  Gender  SeniorCitizen Partner Dependents  tenure PhoneService
0        5575    Male              0      No         No      34           No
1        7795  Female              0      No         No      45           No

As you can see, the DataFrame now contains only the odd rows because we skip the even rows.

All this without loading the entire file and consuming the memory.

Note that you can use the callable function against the row indices only, not the row values.

 

Limiting Rows to Read

You can use the ‘nrows’ parameter to specify how many rows to read from the start of the data.

df = pd.read_csv('telecom_data_2.csv', nrows=10)
print(df)

Output:

   customer_id  gender  age  monthly_charges churn
0            1    Male   34             56.2    No
1            2  Female   45             74.6   Yes
2            3  Female   23             66.7    No
3            4    Male   51             86.4   Yes
4            5  Female   30             54.6    No
5            6    Male   45             78.5   Yes
6            7  Female   25             56.6    No
7            8    Male   35             87.6   Yes
8            9  Female   28             54.8    No
9           10    Male   50             82.4   Yes

Here, we are reading only the first 10 rows from the file.

 

Get specific Columns

The usecols argument helps to get a subset of columns from the CSV file. This can be a sequence of strings (the names of the columns) or integers (the column indices).

df = pd.read_csv('telecom_data_2.csv', usecols=['customer_id', 'age'])
print(df.head())

Output:

   customer_id  age
0            1   34
1            2   45
2            3   23
3            4   51
4            5   30

Here, we’re only loading the ‘customer_id’ and ‘age’ columns from the CSV file.

 

Providing Column Names

If the CSV file does not contain headers, or you want to replace the existing headers, you can provide a list of column names using the ‘names’ parameter.

df = pd.read_csv('telecom_data.csv', names=['ID', 'Gender', 'Age', 'Charges', 'Churn'])
print(df.head())

Here, new column names are passed, i.e., ‘ID’, ‘Gender’, ‘Age’, ‘Charges’, ‘Churn’ which we will use as the column names.

 

Handling CSV files with quotes

Occasionally, CSV files have fields with quotes. By default, pandas considers double quotes as the standard. However, if a file uses another character as quotes, you can set it with the ‘quotechar’ parameter.

df = pd.read_csv('telecom_data.csv', quotechar="'")

This code doesn’t change the output but it handles the loading of the file when the fields are enclosed with single quotes.

 

Controlling Data Types

The ‘dtype’ parameter allows you to specify data types for different columns. This can be a dictionary where the keys can either be integers or column labels, and the values are NumPy data types.

df = pd.read_csv('telecom_data.csv', dtype={'customer_id': 'int64', 'age': 'float64'})
print(df.dtypes)

Output:

customer_id         int64
gender             object
age               float64
monthly_charges    float64
churn              object
dtype: object

Here, the ‘customer_id’ column is forced to have integer (int64) dtype, while the ‘age’ column is specified to have a floating point (float64) dtype.

Note: specifying data types plays an important role in terms of performance. It can speed up loading large datasets.

 

Handle Boolean data

If your data contains Boolean values represented as words (like ‘Yes’/’No’ or ‘Y/N’), you can map them to Boolean values using ‘true_values’ and ‘false_values’ parameters.

df = pd.read_csv('telecom_data.csv', true_values=['Yes'], false_values=['No'])
print(df.head())

Output:

   CustomerID  Gender  SeniorCitizen  Partner  Dependents  tenure  PhoneService
0        7590    Male              0    False       False       1          True
1        5575    Male              0    False       False      34         False
2        3668  Female              0    False       False       2          True
3        7795  Female              0    False       False      45         False
4        9237  Female              0    False       False       2          True

 

This code will interpret ‘Yes’ and ‘No’ as Boolean values (True and False) in the entire DataFrame.

 

Parsing Date Columns

You can use the ‘parse_dates’ parameter to specify the columns that should be parsed as dates.

df = pd.read_csv('telecom_data.csv', parse_dates=['start_date'])
print(df.dtypes)

Output:

customer_id                 int64
gender                     object
age                         int64
monthly_charges           float64
churn                      object
start_date         datetime64[ns]
dtype: object

Here, the ‘start_date’ column is parsed as a datetime object.

 

Handling non-US date formats

The ‘dayfirst’ parameter is used to tell pandas to interpret the date as DD/MM (True) or MM/DD (False). By default, it’s set to False.

df = pd.read_csv('telecom_data.csv', parse_dates=['Activation_date'], dayfirst=True)

Output:

   CustomerID  Gender  SeniorCitizen Partner Dependents  tenure PhoneService ActivationDate
0        7590    Male              0      No         No       1          Yes     2019-07-12
1        5575    Male              0      No         No      34           No     2019-03-11
2        3668  Female              0      No         No       2          Yes     2019-11-20
3        7795  Female              0      No         No      45           No     2019-04-17
4        9237  Female              0      No         No       2          Yes     2019-07-19

 

This code doesn’t change the output but it will interpret the ‘Activation_date’ as DD/MM.

 

Removing unwanted spaces

The ‘skipinitialspace’ parameter, when set to True, will skip spaces after the delimiter.

df = pd.read_csv('telecom_data.csv', skipinitialspace=True)

 

Reading zipped CSV files

The ‘compression’ parameter allows you to read compressed CSV files directly. It can be set to ‘gzip’, ‘zip’, ‘bz2’, ‘xz’, or ‘infer’. If set to ‘infer’, pandas will guess the compression based on the file extension.

df = pd.read_csv('telecom_data.csv.zip', compression='zip')
print(df.head())

This will read the compressed CSV file and return the first five rows. It behaves exactly like reading an uncompressed file, so the output will be identical to the one from the pd.read_csv('telecom_data.csv') line.

 

Reading CSV files with different encoding

If your CSV file is in a different encoding, you can specify the encoding with the ‘encoding’ parameter.

df = pd.read_csv('telecom_data.csv', encoding='Windows-1256')

This will read the CSV file with ‘Windows-1256’ encoding.

 

Reading Large CSV Files

If your CSV file is too large to fit into memory, you can read the file in chunks.

chunks = pd.read_csv('telecom_data.csv', chunksize=10000)
for chunk in chunks:
    print(chunk.head())

This code reads the data in chunks of 10000 rows at a time and prints the first 5 rows of each chunk.

 

Handling blank lines

The ‘skip_blank_lines’ parameter specifies whether or not blank lines should be ignored. The read_csv function sets this parameter to True by default. If False, then treat them as NaN.

Consider the following sample data:

CustomerID,Gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService

7590,Male,0,No,No,1,Yes

5575,Male,0,No,No,34,No

3668,Female,0,No,No,2,Yes

7795,Female,0,No,No,45,No

9237,Female,0,No,No,2,Yes

Let’s see what we will get if we read this data:

df = pd.read_csv('telecom_data.csv')

Output:

   CustomerID  Gender  SeniorCitizen Partner Dependents  tenure PhoneService
0        7590    Male              0      No         No       1          Yes
1        5575    Male              0      No         No      34           No
2        3668  Female              0      No         No       2          Yes
3        7795  Female              0      No         No      45           No
4        9237  Female              0      No         No       2          Yes

And here what you’ll get if you set it to True:

df = pd.read_csv('telecom_data.csv', skip_blank_lines=False)

Output:

   CustomerID  Gender  SeniorCitizen Partner Dependents  tenure PhoneService
0         NaN     NaN            NaN     NaN        NaN     NaN          NaN
1      7590.0    Male            0.0      No         No     1.0          Yes
2         NaN     NaN            NaN     NaN        NaN     NaN          NaN
3      5575.0    Male            0.0      No         No    34.0           No
4         NaN     NaN            NaN     NaN        NaN     NaN          NaN
5      3668.0  Female            0.0      No         No     2.0          Yes
6         NaN     NaN            NaN     NaN        NaN     NaN          NaN
7      7795.0  Female            0.0      No         No    45.0           No
8         NaN     NaN            NaN     NaN        NaN     NaN          NaN
9      9237.0  Female            0.0      No         No     2.0          Yes

 

Handle NaN values

You can use ‘na_values’ to specify additional strings to recognize as NaN.

By default, the following values are interpreted as NaN: ‘#N/A’, ‘NULL’, ‘null’, ‘N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘NA’, ‘NaN’, ‘n/a’, ‘nan’.

df = pd.read_csv('telecom_data.csv', na_values=['No Data', 'Missing'], keep_default_na=False, na_filter=True)

This code will also consider ‘No Data’ and ‘Missing’ as NaN in addition to the default NaN values.

 

Memories with read_csv

I remember one particularly challenging project during my work as a Python developer for a major telecom company in Egypt.

We were working on a comprehensive project that involved the analysis of massive customer data. We had to handle a multitude of datasets stored in CSV format, each file exceeding the gigabyte mark.

We were not just dealing with large files, but extremely large ones – think terabytes.

Initially, we tried to load these enormous CSV files using the traditional pd.read_csv() method, but we quickly hit a roadblock.

Our machines, even with high-end specifications, were unable to load such large datasets into memory. This bottleneck threatened to halt our project progress, and it was evident that we needed a smarter solution.

That’s when I digged deeper into the Pandas library and discovered the chunksize parameter in the pd.read_csv() function.

The chunksize parameter allows pandas to read a file chunk by chunk, rather than trying to load the entire file at once.

With chunksize, we were able to read the data in manageable chunks – we found the sweet spot around 10,000 rows per chunk – effectively bypassing the memory issue.

Our Python scripts, once crashing due to the sheer volume of data, were now running smoothly, processing these gigabyte-sized files without a hitch.

However, another obstacle emerged soon after. Our customer data had date fields in the ‘DD-MM-YYYY’ format, which is common in Egypt, but it’s the opposite of the ‘MM-DD-YYYY’ format typically used in the US.

When pandas read our CSV files, it incorrectly parsed the dates, swapping the days and months. Consequently, we ended up with data stating customers had subscribed to services on the 13th month of the year!

Fortunately, the read_csv function came to our rescue yet again. It includes a dayfirst parameter, which when set to True, signals pandas to interpret the date in the ‘DD-MM-YYYY’ format. Using this parameter, we were able to accurately parse the dates.

It was a straightforward solution, but it made a huge difference to our project.

We went from having nearly 13% of our date data incorrectly parsed to achieving near-perfect accuracy.

And another obstacle was when I was working on another CSV file from our telecom database that contained customers’ feedback.

When I tried to read this file using the standard pd.read_csv() function, I was met with the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 12: invalid continuation byte

This error occurs because the default encoding used by pandas’ read_csv function is ‘utf-8’, but some of the CSV files aren’t.

Here’s a simplified sample data to illustrate the problem:

id,feedback
1,"السرعة بطيئة جدا"
2,"التغطية ضعيفة في منطقتي"

Just some negative feedback if you want a translation 🙂

To solve this issue, I had to specify the correct encoding of the file.

df = pd.read_csv('customer_feedback.csv', encoding='Windows-1256')
print(df)

Problem solved!

So, by correctly specifying the encoding, we can read CSV files that aren’t UTF-8.

 

F.A.Q

Q: Can I load the first column of a CSV file as the index in Python using pandas.read_csv?

A: Yes, using the “index_col” parameter in the pandas.read_csv function, you can specify the first column to be used as the index of the resulting DataFrame.

Just set “index_col=0” while calling the function, and it will load the first column as the index.

Q: How do I read a CSV file without headers using pandas.read_csv in Python?

A: To read a CSV file without headers using pandas.read_csv, you can set the “header” parameter to None.

df = pd.read_csv('telecom_data.csv', header=None)

This will make the function read the data without considering the first row as column names and default column names will be generated as integers starting from 0.

Q: Is it possible to specify custom column names when reading a CSV file using pandas.read_csv?

A: Yes, you can specify custom column names while reading a CSV file using the pandas.read_csv function.

To do this, use the “names” parameter and pass a list of strings representing the column names.

This will override the column names present in the CSV file or generate new names for a file without headers.

Q: Can I read a CSV file from a URL using the pandas.read_csv function in Python?

A: Yes, you can read a CSV file from a URL directly using the pandas.read_csv function.

import pandas as pd
url = "https://websiteurl/csv_file.csv"
df = pd.read_csv(url)

Instead of providing a local file name as the input, simply pass the URL as a string to the function, and it will load the data from the CSV file found at that URL into a DataFrame.

Q: How do I handle missing values while reading a CSV file using pandas.read_csv in Python?

A: You can handle missing values while reading a CSV file using the pandas.read_csv function by specifying the “na_values” parameter.

Pass a list of values or a dictionary with specific per-column na values that should be interpreted as NaN. The function will replace these values with NaN when loading the data into a DataFrame.

Q: What is the default behavior of the pandas.read_csv function in Python when column names are not specified?

A: The default behavior of the pandas.read_csv function when column names are not specified is to infer the column names from the first row of the file if a header is present.

If the file does not have a header, then default column names will be generated as integers starting from 0.

Q: Can I read specific rows of a CSV file using Python and pandas.read_csv?

A: Yes, you can read specific rows of a CSV file using the pandas.read_csv function by using the “skiprows” or “nrows” parameters.

The “skiprows” parameter takes a list of rows to be skipped while reading the file, whereas the “nrows” parameter specifies the number of rows to be read starting from the first data row after the header.


Leave a Reply

Your email address will not be published. Required fields are marked *