Read CSV using Python Pandas read_csv
In this tutorial, we will focus on the read_csv
function, which allows us to read CSV files and convert them into a Pandas DataFrame.
The read_csv
function is more than just a CSV file loading tool. It is equipped with a multitude of parameters to handle a wide range of scenarios that might occur in a CSV file.
- 1 What is Pandas read_csv?
- 2 Specifying the Data Source
- 3 Customizing Delimiters
- 4 Selecting Data Headers
- 5 Setting the DataFrame Index
- 6 Skipping Rows
- 7 Limiting Rows to Read
- 8 Get specific Columns
- 9 Providing Column Names
- 10 Handling CSV files with quotes
- 11 Controlling Data Types
- 12 Handle Boolean data
- 13 Parsing Date Columns
- 14 Handling non-US date formats
- 15 Removing unwanted spaces
- 16 Reading zipped CSV files
- 17 Reading CSV files with different encoding
- 18 Reading Large CSV Files
- 19 Handling blank lines
- 20 Handle NaN values
- 21 Memories with read_csv
- 22 F.A.Q
What is Pandas read_csv?
Whether it’s handling skipping unnecessary rows, getting specific columns, defining column data types, parsing date columns, handling blank lines, or even dealing with null values, read_csv
has got you covered.
You’ll learn how to read different CSV files in different ways, so let’s get started.
Specifying the Data Source
The first argument for pandas read_csv
is a string path to the CSV file we want to read. This string could be a URL, a file path, or any valid string name representing the source of the data.
Let’s load from a CSV file some telecom company data.
df = pd.read_csv('telecom_data.csv') print(df.head())
Output:
CustomerID Gender SeniorCitizen Partner Dependents tenure PhoneService 0 7590 Male 0 No No 1 Yes 1 5575 Male 0 No No 34 No 2 3668 Female 0 No No 2 Yes 3 7795 Female 0 No No 45 No 4 9237 Female 0 No No 2 Yes
Here, we use the read_csv
function of pandas to load the data from our CSV file, and then we print the first 5 rows using the .head()
function.
Customizing Delimiters
Sometimes, data may not be separated by commas (the default separator) in the CSV files. The ‘sep’ parameter allows you to specify the delimiter that separates the data.
I converted the sample file to a tab-separated to demonstrate how to handle files with different separators and it looks like this:
CustomerID Gender SeniorCitizen Partner Dependents tenure PhoneService 7590 Male 0 No No 1 Yes 5575 Male 0 No No 34 No 3668 Female 0 No No 2 Yes 7795 Female 0 No No 45 No 9237 Female 0 No No 2 Yes
Now, let’s read using read_csv
:
df = pd.read_csv('telecom_data.tsv', sep='\t') print(df.head())
Output:
CustomerID Gender SeniorCitizen Partner Dependents tenure PhoneService 0 7590 Male 0 No No 1 Yes 1 5575 Male 0 No No 34 No 2 3668 Female 0 No No 2 Yes 3 7795 Female 0 No No 45 No 4 9237 Female 0 No No 2 Yes
In this case, ‘\t’ (tab) is used as the delimiter. If the delimiter is not specified correctly, pandas will interpret the entire row as a single column.
Selecting Data Headers
By default, pandas infer the column names from the first line of the file. But sometimes, you might want to use a different row as your headers. You can achieve this by setting the ‘header’ parameter.
df = pd.read_csv('telecom_data.csv', header=1) print(df.head())
Output:
7590 Male 0 No No 1 Yes 0 5575 Male 0 No No 34 No 1 3668 Female 0 No No 2 Yes 2 7795 Female 0 No No 45 No 3 9237 Female 0 No No 2 Yes 4 9305 Male 0 Yes No 8 Yes
Now, pandas will infer the column names from the second line (indexing starts from 0), not the first line.
For files with multi-index column names, you can pass a list to the header parameter to specify the rows to use.
data = StringIO("Header1,Header2,Header3\nSubHeader1,SubHeader2,SubHeader3\nA,B,C\n1,2,3") df = pd.read_csv(data, header=[0,1]) print(df)
Output:
Header1 SubHeader1 Header2 SubHeader2 Header3 SubHeader3 0 A B C 1 2 3
Note: if you have data in the first row, remember to set header=0 otherwise, it will be treated as column names.
Setting the DataFrame Index
The ‘index_col’ parameter specifies the column to use as the row labels of the DataFrame. It can either be given as string name or column index.
df = pd.read_csv('telecom_data_2.csv', index_col='customer_id') print(df.head())
Output:
gender age monthly_charges churn customer_id 1 Male 34 56.2 No 2 Female 45 74.6 Yes 3 Female 23 66.7 No 4 Male 51 86.4 Yes 5 Female 30 54.6 No
Here, the ‘customer_id’ column is used as the index of the DataFrame.
Skipping Rows
The ‘skiprows’ parameter allows you to skip certain rows. This can be a list of integers specifying the row numbers, or a callable function that is evaluated against the row indices, returning true if the row should be skipped.
df = pd.read_csv('telecom_data_2.csv', skiprows=[1,2,3]) print(df.head())
Output:
customer_id gender age monthly_charges churn 0 4 Male 51 86.4 Yes 1 5 Female 30 54.6 No 2 6 Male 45 78.5 Yes 3 7 Female 25 56.6 No 4 8 Male 35 87.6 Yes
In this case, rows 1, 2, and 3 (2nd, 3rd, and 4th in human count) are skipped.
Conditional Skip
You can also pass a callable function to skip specific rows where the index matched the criteria:
df = pd.read_csv('telecom_data.csv', skiprows=lambda i: i % 2 == 0) print(df)
Output:
CustomerID Gender SeniorCitizen Partner Dependents tenure PhoneService 0 5575 Male 0 No No 34 No 1 7795 Female 0 No No 45 No
As you can see, the DataFrame now contains only the odd rows because we skip the even rows.
All this without loading the entire file and consuming the memory.
Note that you can use the callable function against the row indices only, not the row values.
Limiting Rows to Read
You can use the ‘nrows’ parameter to specify how many rows to read from the start of the data.
df = pd.read_csv('telecom_data_2.csv', nrows=10) print(df)
Output:
customer_id gender age monthly_charges churn 0 1 Male 34 56.2 No 1 2 Female 45 74.6 Yes 2 3 Female 23 66.7 No 3 4 Male 51 86.4 Yes 4 5 Female 30 54.6 No 5 6 Male 45 78.5 Yes 6 7 Female 25 56.6 No 7 8 Male 35 87.6 Yes 8 9 Female 28 54.8 No 9 10 Male 50 82.4 Yes
Here, we are reading only the first 10 rows from the file.
Get specific Columns
The usecols argument helps to get a subset of columns from the CSV file. This can be a sequence of strings (the names of the columns) or integers (the column indices).
df = pd.read_csv('telecom_data_2.csv', usecols=['customer_id', 'age']) print(df.head())
Output:
customer_id age 0 1 34 1 2 45 2 3 23 3 4 51 4 5 30
Here, we’re only loading the ‘customer_id’ and ‘age’ columns from the CSV file.
Providing Column Names
If the CSV file does not contain headers, or you want to replace the existing headers, you can provide a list of column names using the ‘names’ parameter.
df = pd.read_csv('telecom_data.csv', names=['ID', 'Gender', 'Age', 'Charges', 'Churn']) print(df.head())
Here, new column names are passed, i.e., ‘ID’, ‘Gender’, ‘Age’, ‘Charges’, ‘Churn’ which we will use as the column names.
Handling CSV files with quotes
Occasionally, CSV files have fields with quotes. By default, pandas considers double quotes as the standard. However, if a file uses another character as quotes, you can set it with the ‘quotechar’ parameter.
df = pd.read_csv('telecom_data.csv', quotechar="'")
This code doesn’t change the output but it handles the loading of the file when the fields are enclosed with single quotes.
Controlling Data Types
The ‘dtype’ parameter allows you to specify data types for different columns. This can be a dictionary where the keys can either be integers or column labels, and the values are NumPy data types.
df = pd.read_csv('telecom_data.csv', dtype={'customer_id': 'int64', 'age': 'float64'}) print(df.dtypes)
Output:
customer_id int64 gender object age float64 monthly_charges float64 churn object dtype: object
Here, the ‘customer_id’ column is forced to have integer (int64) dtype, while the ‘age’ column is specified to have a floating point (float64) dtype.
Note: specifying data types plays an important role in terms of performance. It can speed up loading large datasets.
Handle Boolean data
If your data contains Boolean values represented as words (like ‘Yes’/’No’ or ‘Y/N’), you can map them to Boolean values using ‘true_values’ and ‘false_values’ parameters.
df = pd.read_csv('telecom_data.csv', true_values=['Yes'], false_values=['No']) print(df.head())
Output:
CustomerID Gender SeniorCitizen Partner Dependents tenure PhoneService 0 7590 Male 0 False False 1 True 1 5575 Male 0 False False 34 False 2 3668 Female 0 False False 2 True 3 7795 Female 0 False False 45 False 4 9237 Female 0 False False 2 True
This code will interpret ‘Yes’ and ‘No’ as Boolean values (True and False) in the entire DataFrame.
Parsing Date Columns
You can use the ‘parse_dates’ parameter to specify the columns that should be parsed as dates.
df = pd.read_csv('telecom_data.csv', parse_dates=['start_date']) print(df.dtypes)
Output:
customer_id int64 gender object age int64 monthly_charges float64 churn object start_date datetime64[ns] dtype: object
Here, the ‘start_date’ column is parsed as a datetime object.
Handling non-US date formats
The ‘dayfirst’ parameter is used to tell pandas to interpret the date as DD/MM (True) or MM/DD (False). By default, it’s set to False.
df = pd.read_csv('telecom_data.csv', parse_dates=['Activation_date'], dayfirst=True)
Output:
CustomerID Gender SeniorCitizen Partner Dependents tenure PhoneService ActivationDate 0 7590 Male 0 No No 1 Yes 2019-07-12 1 5575 Male 0 No No 34 No 2019-03-11 2 3668 Female 0 No No 2 Yes 2019-11-20 3 7795 Female 0 No No 45 No 2019-04-17 4 9237 Female 0 No No 2 Yes 2019-07-19
This code doesn’t change the output but it will interpret the ‘Activation_date’ as DD/MM.
Removing unwanted spaces
The ‘skipinitialspace’ parameter, when set to True, will skip spaces after the delimiter.
df = pd.read_csv('telecom_data.csv', skipinitialspace=True)
Reading zipped CSV files
The ‘compression’ parameter allows you to read compressed CSV files directly. It can be set to ‘gzip’, ‘zip’, ‘bz2’, ‘xz’, or ‘infer’. If set to ‘infer’, pandas will guess the compression based on the file extension.
df = pd.read_csv('telecom_data.csv.zip', compression='zip') print(df.head())
This will read the compressed CSV file and return the first five rows. It behaves exactly like reading an uncompressed file, so the output will be identical to the one from the pd.read_csv('telecom_data.csv')
line.
Reading CSV files with different encoding
If your CSV file is in a different encoding, you can specify the encoding with the ‘encoding’ parameter.
df = pd.read_csv('telecom_data.csv', encoding='Windows-1256')
This will read the CSV file with ‘Windows-1256’ encoding.
Reading Large CSV Files
If your CSV file is too large to fit into memory, you can read the file in chunks.
chunks = pd.read_csv('telecom_data.csv', chunksize=10000) for chunk in chunks: print(chunk.head())
This code reads the data in chunks of 10000 rows at a time and prints the first 5 rows of each chunk.
Handling blank lines
The ‘skip_blank_lines’ parameter specifies whether or not blank lines should be ignored. The read_csv function sets this parameter to True by default. If False, then treat them as NaN.
Consider the following sample data:
CustomerID,Gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService 7590,Male,0,No,No,1,Yes 5575,Male,0,No,No,34,No 3668,Female,0,No,No,2,Yes 7795,Female,0,No,No,45,No 9237,Female,0,No,No,2,Yes
Let’s see what we will get if we read this data:
df = pd.read_csv('telecom_data.csv')
Output:
CustomerID Gender SeniorCitizen Partner Dependents tenure PhoneService 0 7590 Male 0 No No 1 Yes 1 5575 Male 0 No No 34 No 2 3668 Female 0 No No 2 Yes 3 7795 Female 0 No No 45 No 4 9237 Female 0 No No 2 Yes
And here what you’ll get if you set it to True:
df = pd.read_csv('telecom_data.csv', skip_blank_lines=False)
Output:
CustomerID Gender SeniorCitizen Partner Dependents tenure PhoneService 0 NaN NaN NaN NaN NaN NaN NaN 1 7590.0 Male 0.0 No No 1.0 Yes 2 NaN NaN NaN NaN NaN NaN NaN 3 5575.0 Male 0.0 No No 34.0 No 4 NaN NaN NaN NaN NaN NaN NaN 5 3668.0 Female 0.0 No No 2.0 Yes 6 NaN NaN NaN NaN NaN NaN NaN 7 7795.0 Female 0.0 No No 45.0 No 8 NaN NaN NaN NaN NaN NaN NaN 9 9237.0 Female 0.0 No No 2.0 Yes
Handle NaN values
You can use ‘na_values’ to specify additional strings to recognize as NaN.
By default, the following values are interpreted as NaN: ‘#N/A’, ‘NULL’, ‘null’, ‘N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘NA’, ‘NaN’, ‘n/a’, ‘nan’.
df = pd.read_csv('telecom_data.csv', na_values=['No Data', 'Missing'], keep_default_na=False, na_filter=True)
This code will also consider ‘No Data’ and ‘Missing’ as NaN in addition to the default NaN values.
Memories with read_csv
I remember one particularly challenging project during my work as a Python developer for a major telecom company in Egypt.
We were working on a comprehensive project that involved the analysis of massive customer data. We had to handle a multitude of datasets stored in CSV format, each file exceeding the gigabyte mark.
We were not just dealing with large files, but extremely large ones – think terabytes.
Initially, we tried to load these enormous CSV files using the traditional pd.read_csv()
method, but we quickly hit a roadblock.
Our machines, even with high-end specifications, were unable to load such large datasets into memory. This bottleneck threatened to halt our project progress, and it was evident that we needed a smarter solution.
That’s when I digged deeper into the Pandas library and discovered the chunksize
parameter in the pd.read_csv()
function.
The chunksize
parameter allows pandas to read a file chunk by chunk, rather than trying to load the entire file at once.
With chunksize
, we were able to read the data in manageable chunks – we found the sweet spot around 10,000 rows per chunk – effectively bypassing the memory issue.
Our Python scripts, once crashing due to the sheer volume of data, were now running smoothly, processing these gigabyte-sized files without a hitch.
However, another obstacle emerged soon after. Our customer data had date fields in the ‘DD-MM-YYYY’ format, which is common in Egypt, but it’s the opposite of the ‘MM-DD-YYYY’ format typically used in the US.
When pandas read our CSV files, it incorrectly parsed the dates, swapping the days and months. Consequently, we ended up with data stating customers had subscribed to services on the 13th month of the year!
Fortunately, the read_csv
function came to our rescue yet again. It includes a dayfirst
parameter, which when set to True
, signals pandas to interpret the date in the ‘DD-MM-YYYY’ format. Using this parameter, we were able to accurately parse the dates.
It was a straightforward solution, but it made a huge difference to our project.
We went from having nearly 13% of our date data incorrectly parsed to achieving near-perfect accuracy.
And another obstacle was when I was working on another CSV file from our telecom database that contained customers’ feedback.
When I tried to read this file using the standard pd.read_csv()
function, I was met with the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 12: invalid continuation byte
This error occurs because the default encoding used by pandas’ read_csv
function is ‘utf-8’, but some of the CSV files aren’t.
Here’s a simplified sample data to illustrate the problem:
id,feedback 1,"السرعة بطيئة جدا" 2,"التغطية ضعيفة في منطقتي"
Just some negative feedback if you want a translation 🙂
To solve this issue, I had to specify the correct encoding of the file.
df = pd.read_csv('customer_feedback.csv', encoding='Windows-1256') print(df)
Problem solved!
So, by correctly specifying the encoding, we can read CSV files that aren’t UTF-8.
F.A.Q
Q: Can I load the first column of a CSV file as the index in Python using pandas.read_csv?
A: Yes, using the “index_col” parameter in the pandas.read_csv function, you can specify the first column to be used as the index of the resulting DataFrame.
Just set “index_col=0” while calling the function, and it will load the first column as the index.
Q: How do I read a CSV file without headers using pandas.read_csv in Python?
A: To read a CSV file without headers using pandas.read_csv, you can set the “header” parameter to None.
df = pd.read_csv('telecom_data.csv', header=None)
This will make the function read the data without considering the first row as column names and default column names will be generated as integers starting from 0.
Q: Is it possible to specify custom column names when reading a CSV file using pandas.read_csv?
A: Yes, you can specify custom column names while reading a CSV file using the pandas.read_csv function.
To do this, use the “names” parameter and pass a list of strings representing the column names.
This will override the column names present in the CSV file or generate new names for a file without headers.
Q: Can I read a CSV file from a URL using the pandas.read_csv function in Python?
A: Yes, you can read a CSV file from a URL directly using the pandas.read_csv function.
import pandas as pd url = "https://websiteurl/csv_file.csv" df = pd.read_csv(url)
Instead of providing a local file name as the input, simply pass the URL as a string to the function, and it will load the data from the CSV file found at that URL into a DataFrame.
Q: How do I handle missing values while reading a CSV file using pandas.read_csv in Python?
A: You can handle missing values while reading a CSV file using the pandas.read_csv function by specifying the “na_values” parameter.
Pass a list of values or a dictionary with specific per-column na values that should be interpreted as NaN. The function will replace these values with NaN when loading the data into a DataFrame.
Q: What is the default behavior of the pandas.read_csv function in Python when column names are not specified?
A: The default behavior of the pandas.read_csv function when column names are not specified is to infer the column names from the first row of the file if a header is present.
If the file does not have a header, then default column names will be generated as integers starting from 0.
Q: Can I read specific rows of a CSV file using Python and pandas.read_csv?
A: Yes, you can read specific rows of a CSV file using the pandas.read_csv function by using the “skiprows” or “nrows” parameters.
The “skiprows” parameter takes a list of rows to be skipped while reading the file, whereas the “nrows” parameter specifies the number of rows to be read starting from the first data row after the header.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.