Read JSON from URL using Pandas read_json & requests
In this tutorial, you’ll learn how to use Pandas to read JSON directly from a URL into a DataFrame.
You’ll learn how to make GET and POST requests to retrieve JSON data, handle authentication and redirects, deal with paginated responses, and more.
Get JSON from URL using read_json
To retrieve JSON data from a URL, you’ll use Pandasread_json
method.
This function simplifies the process of converting JSON content directly into a Pandas DataFrame.
Here’s how you can use pandas.read_json
to load data from a URL:
import pandas as pd json_url = 'http://example.com/data.json' df = pd.read_json(json_url)
Output:
user_id call_duration data_consumption bill_amount 0 12345 300 2.50 35.0 1 23456 220 1.75 28.0 2 34567 450 3.00 45.0
The output is a DataFrame with columns like user_id
, call_duration
, data_consumption
, and bill_amount
.
Get JSON Using GET Request
The previous example was direct, but the JSON response is not always on available on direct URLs.
Before using pandas.read_json
, you can send an HTTP GET request to the URL.
This is useful when you need to handle the nuances of web requests, such as authentication, headers, or managing response status codes.
The requests
library in Python is used for making HTTP requests.
Here is how you can make a GET request to retrieve JSON data:
import requests import pandas as pd json_url = 'http://example.com/data.json' response = requests.get(json_url) # Check if the request was successful if response.status_code == 200: df = pd.read_json(response.content) else: print(f"Failed to retrieve data: {response.status_code}")
If there was an issue with the request, you would see an error message with the corresponding HTTP status code.
Get JSON From POST Request
The requests
library allows you to make POST requests.
Here’s how to make a POST request and read the JSON response:
import requests import pandas as pd json_url = 'http://example.com/data.json' post_data = { 'apikey': 'YOUR_API_KEY', 'other_param': 'value' } response = requests.post(json_url, data=post_data) if response.status_code == 200: df = pd.read_json(response.content) else: print(f"Failed to retrieve data: {response.status_code}")
Handle Redirects
When making HTTP requests, it’s possible to encounter redirects.
The requests
library handles redirects by default for GET requests.
Here’s how you can handle redirects:
import requests import pandas as pd json_url = 'http://example.com/data.json' # Set allow_redirects to True to follow redirects (default behavior) response = requests.get(json_url, allow_redirects=True) final_url = response.url # Verify if the request was successful and the content has been redirected if response.history: print(f"Request was redirected from {json_url} to {final_url}") if response.status_code == 200: df = pd.read_json(final_url) else: print(f"Failed to retrieve data: {response.status_code}")
The response.history
list will contain the response objects that were created in order to complete the request.
The list is empty if no redirection has occurred. In the code, if a redirect happens, it informs you of the initial and final URLs.
Handle Authentication
Accessing secured data often requires authentication, which involves sending credentials or tokens as part of the HTTP request.
With the requests
library, you can pass in the necessary headers or query parameters with your GET request to authenticate and retrieve the data.
Here is how to handle authentication with a GET request:
import requests import pandas as pd json_url = 'http://example.com/secure_data.json' # Authentication credentials headers = { 'Authorization': 'Bearer YOUR_ACCESS_TOKEN' } # query parameter params = { 'access_token': 'YOUR_ACCESS_TOKEN' } response = requests.get(json_url, headers=headers, params=params) if response.status_code == 200: df = pd.read_json(response.content) else: print(f"Failed to retrieve data: {response.status_code}")
In the code, the headers
dictionary is used to pass the “Authorization” token as a Bearer token (a common method for API authentication).
Alternatively, the params
dictionary is shown as another method where the token could be passed as a query parameter, depending on the API’s requirement.
Handling Paginated JSON Responses
Dealing with paginated JSON responses is a common scenario when working with APIs.
Pagination involves splitting the data into discrete chunks, which can be accessed page by page.
Here’s how to handle pagination with your requests:
import requests import pandas as pd json_url = 'http://example.com/data.json' df = pd.DataFrame() page = 1 per_page = 100 has_more_pages = True while has_more_pages: # Update the query parameters for pagination params = {'page': page, 'per_page': per_page} response = requests.get(json_url, params=params) if response.status_code == 200: current_page_df = pd.read_json(response.content) if current_page_df.empty: has_more_pages = False else: df = df.append(current_page_df, ignore_index=True) page += 1 else: print(f"Failed to retrieve data: {response.status_code}") break print(df.head())
The output of this code would display the first 5 rows of the concatenated DataFrame, which contains data from all the pages you’ve looped through.
The loop continues until a page returns no data, indicating that there are no more pages to fetch.
Streaming Large JSON Data
When working with very large datasets, it’s efficient to stream the data instead of loading it all at once.
Let’s explore how to stream JSON data:
import requests import pandas as pd json_url = 'http://example.com/large_data.json' # Initiating a session to maintain connection pooling and reusing session = requests.Session() with session.get(json_url, stream=True) as response: if response.status_code == 200: df = pd.DataFrame() # Stream the response content line by line for line in response.iter_lines(): if line: current_chunk = pd.read_json(line) df = df.append(current_chunk, ignore_index=True) else: print(f"Failed to retrieve data: {response.status_code}") print(df.shape)
Rate Limiting
When interacting with APIs, it’s common to encounter rate limits that cap the number of requests you can make in a given period.
Here’s how to deal with rate limiting:
import requests import pandas as pd import time json_url = 'http://example.com/data.json' REQUEST_LIMIT_PER_MINUTE = 30 delay = 60 / REQUEST_LIMIT_PER_MINUTE df = pd.DataFrame() requests_made = 0 # Timestamp of when the last request was made last_request_time = time.time() # Make multiple requests in a loop for i in range(100): # Assuming you need to make 100 requests # Calculate how much time passed since the last request time_passed = time.time() - last_request_time # Check if we need to wait before making a new request if time_passed < delay: time.sleep(delay - time_passed) response = requests.get(json_url) requests_made += 1 last_request_time = time.time() if response.status_code == 200: current_data = pd.read_json(response.content) df = df.append(current_data, ignore_index=True) else: print(f"Request #{requests_made} failed with status code: {response.status_code}") break # Exit the loop if a request fails print(f"Total successful requests made: {requests_made}") print(df.head())
The time.sleep()
function is used to add a delay between requests, which helps avoid exceeding the rate limit.
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.