Parsing XML Files into DataFrames using Pandas read_xml
The read_xml
function in Pandas is used to read XML (eXtensible Markup Language) files and convert them into DataFrames.
Let’s dive into practical examples to understand how it works.
- 1 Pandas read_xml Syntax
- 2 Read from Various Sources
- 3 Choosing an XML parser
- 4 Specify Column Names
- 5 Handling Compressed XML Files
- 6 Advanced Storage Connection Options
- 7 Using XPath Expressions to Select Data
- 8 Working with XML Namespaces
- 9 Reading Only Element Values
- 10 Reading Only Attribute Values
- 11 Specify Column Data Types
- 12 Custom Conversion Functions for Specific Columns
- 13 Parse Specific Columns as Dates
- 14 Inferring and Enforcing dtypes
- 15 Transform XML Using XSLT Stylesheets
- 16 Stream Parsing: Load Large XML Files
- 17 Resource
Pandas read_xml
Syntax
The basic syntax for pandas.read_xml
is:
pandas.read_xml( path_or_buffer, xpath=None, namespaces=None, elems_only=True, attrs_only=False, names=None, encoding=None, parser="lxml", stylesheet=None, compression="infer", storage_options=None, dtype_backend=None, **kwargs )
The key parameters are:
- path_or_buffer: The path to the XML file, URL, or file-like object.
- xpath: XPath expression to select specific portions of the XML to be parsed.
- namespaces: A dictionary containing XML namespaces for refining selections using XPath.
- elems_only: If
True
, only elements’ text values are parsed. IfFalse
, both elements and attributes are parsed. - attrs_only: If
True
, only attributes’ values are parsed. IfFalse
, elements’ values are parsed. - names: A list of column names for the resulting DataFrame.
- encoding: Encoding type for the XML. Defaults to UTF-8.
- parser: XML parser to use (
lxml
oretree
). - stylesheet: Path to an XSLT stylesheet file to transform the XML data before parsing.
- compression: The compression type (‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None). If ‘infer’, the file extension is used to determine the compression type.
- storage_options: Extra options for storage connection if needed.
- dtype_backend: Backend to use for dtype inference (“python” or “lxml”).
Read from Various Sources
Whether you have an XML file on your local disk, a URL that returns XML data, or a file-like object, read_xml
can read XML from these sources.
Read from a Local XML File
Let’s start with the most common scenario: reading from a local XML file.
import pandas as pd xml_data = """ <data> <row> <shape>square</shape> <degrees>360</degrees> <sides>4.0</sides> </row> <row> <shape>triangle</shape> <degrees>180</degrees> <sides>3.0</sides> </row> </data> """ with open("shapes.xml", "w") as file: file.write(xml_data) df = pd.read_xml("shapes.xml") print(df)
Output:
shape degrees sides 0 square 360 4.0 1 triangle 180 3.0
Here, we’ve written a simple XML string to a file named shapes.xml
and then read it into a DataFrame using read_xml
.
Read from a URL
Sometimes your data is hosted online. In that case, read_xml
can directly read from a URL.
url = "https://example.com/data.xml" df = pd.read_xml(url)
Read from File-like Objects
If you have file-like objects (e.g., from requests or in-memory files). Here’s how to use read them:
from io import StringIO xml_data = """ <data> <row> <name>John</name> <age>28</age> </row> <row> <name>Jane</name> <age>24</age> </row> </data> """ data_io = StringIO(xml_data) df = pd.read_xml(data_io) print(df)
Output:
name age 0 John 28 1 Jane 24
In this example, we simulate a file-like object using StringIO
and then read the XML data from it into a DataFrame.
Choosing an XML parser
Pandas’ read_xml
provides the flexibility to choose among different XML parsers. The parser you choose can impact performance, so it’s worth understanding the options available.
The lxml Parser
By default, read_xml
uses the ‘lxml’ parser. ‘lxml’ is efficient and suitable for large XML files.
df = pd.read_xml("shapes.xml", parser="lxml") print(df)
Output:
shape degrees sides 0 square 360 4.0 1 triangle 180 3.0
This code explicitly specifies the lxml
parser, although it would be the default even if you didn’t specify it.
The etree Parser
Another option is the etree
parser from Python’s standard library.
df = pd.read_xml("shapes.xml", parser="etree") print(df)
Output:
shape degrees sides 0 square 360 4.0 1 triangle 180 3.0
Benchmark Parsers
For the purpose of this demonstration, I’ll first generate a large XML file. Then, I’ll benchmark the read_xml()
function using both parsers.
We’ll generate a large XML file with 1 million entries to compare the performance.
import pandas as pd import random import time # Step 1: Generate a large XML file num_entries = 1000000 shapes = ["triangle", "square", "pentagon", "hexagon"] xml_data = '' for _ in range(num_entries): shape = random.choice(shapes) xml_data += f'{shape}{random.randint(100, 400)}{random.randint(3, 6)}' xml_data += '' with open("large_sample.xml", "w") as f: f.write(xml_data) file_path ="large_sample.xml" # Benchmark for lxml parser start_time_lxml = time.time() df_lxml = pd.read_xml(file_path, parser="lxml") end_time_lxml = time.time() lxml_duration = end_time_lxml - start_time_lxml # Benchmark for etree parser start_time_etree = time.time() df_etree = pd.read_xml(file_path, parser="etree") end_time_etree = time.time() etree_duration = end_time_etree - start_time_etree print(lxml_duration) print(etree_duration)
Output:
44.94610905647278 18.623760223388672
The etree
parser is much faster.
Specify Column Names
Consider an XML file named persons.xml
with the following content:
<data> <row> <nm>John</nm> <ag>28</ag> </row> <row> <nm>Jane</nm> <ag>24</ag> </row> </data>
Notice that the column tags are abbreviated (nm
for name and ag
for age). If you want to rename these columns to more descriptive names when reading the XML into a DataFrame, you can use the names
parameter.
df = pd.read_xml("persons.xml", names=["name", "age"]) print(df)
Output:
name age 0 John 28 1 Jane 24
Handling Compressed XML Files
Pandas’ read_xml
comes equipped with the ability to read compressed XML files directly without the need for manual decompression.
Read GZIP Compressed XML
Let’s say you have an XML file compressed using GZIP, with the filename data.xml.gz
.
To read this into a DataFrame:
df = pd.read_xml("data.xml.gz", compression='gzip') print(df)
Read BZIP2 Compressed XML
If your XML file is compressed using BZIP2, e.g., data.xml.bz2
:
df = pd.read_xml("data.xml.bz2", compression='bz2') print(df)
Read ZIP Compressed XML
For XML files within a ZIP archive, say data.xml.zip
:
df = pd.read_xml("data.xml.zip", compression='zip') print(df)
Auto-detecting Compression
In many cases, you don’t even have to specify the compression type. read_xml
is smart enough to auto-detect the compression from file extensions like .gz
, .bz2
, and .zip
.
df = pd.read_xml("data.xml.gz") print(df)
This code will work seamlessly, even without the compression
parameter, because of the .gz
file extension.
Advanced Storage Connection Options
The storage_options
parameter in read_xml
allows you to pass extra options required to connect to storage systems such as cloud storage.
Read from S3 Bucket with Authentication
Suppose your XML data is stored on an S3 bucket, and accessing this data requires specific credentials. Here’s how to use storage_options
:
s3_path = "s3://your_bucket_name/path_to_file.xml" df = pd.read_xml(s3_path, storage_options={"key": "your_access_key", "secret": "your_secret_key"}) print(df)
Using with Other Storage Systems
storage_options
is not limited to S3. You can use it to pass connection parameters required by various storage systems, like GCS (Google Cloud Storage), Azure Blob Storage, and others.
For example, connecting to GCS will look something like this:
gcs_path = "gcs://your_bucket_name/path_to_file.xml" df = pd.read_xml(gcs_path, storage_options={"token": "your_gcs_token"}) print(df)
Using XPath Expressions to Select Data
XPath is a powerful querying language that allows you to deal with complex XML structures where you need to extract specific sections or attributes.
The xpath
parameter in read_xml
function allows you to utilize XPath expressions.
Basic XPath Selection
Consider an XML file products.xml
with the following structure:
<products> <product category="electronics"> <name>Smartphone</name> <price>500</price> </product> <product category="books"> <name>Python Guide</name> <price>30</price> </product> </products>
To select only the products in the “electronics” category:
df = pd.read_xml("products.xml", xpath="//product[@category='electronics']") print(df)
Output:
category name price 0 electronics Smartphone 500
Extracting Specific Elements
You can further refine your selection. To extract only the names of products in the “electronics” category:
df = pd.read_xml("data.xml", xpath="//product[@category='electronics']") df_name = df['name'] print(df_name)
Output:
0 Smartphone Name: name, dtype: object
Using Functions in XPath
XPath provides various functions to make your selections more dynamic. For example, to select products that are priced above 100:
df = pd.read_xml("products.xml", xpath="//product[number(price)>100]") print(df)
Output:
category name price 0 electronics Smartphone 500
Working with XML Namespaces
XML namespaces are used to differentiate between elements with the same name but defined in different XML vocabularies.
They’re crucial for avoiding naming conflicts in XML documents.
Consider the following XML, saved as data.xml
, which uses namespaces:
<root xmlns:product="http://www.example.com/product" xmlns:price="http://www.example.com/price"> <product:item> <product:name>Laptop</product:name> <price:value>1000</price:value> </product:item> <product:item> <product:name>Mouse</product:name> <price:value>20</price:value> </product:item> </root>
Notice the xmlns:product
and xmlns:price
attributes defining the XML namespaces for product and price, respectively.
Parse XML with Namespaces in Pandas
To parse the XML with namespace, we need to provide these namespaces to the read_xml
function.
namespaces = { "product": "http://www.example.com/product", "price": "http://www.example.com/price" } df = pd.read_xml("data.xml", xpath="//product:item", namespaces=namespaces) print(df)
Output:
name value 0 Laptop 1000 1 Mouse 20
In the code, we provide a dictionary of namespaces and use the namespaces
parameter.
This ensures that Pandas can correctly interpret the XML namespaces and extract the relevant data.
XPath with Namespaces
When querying elements within a namespace, you’ll use the prefix you’ve defined:
namespaces = { 'product': 'http://www.example.com/product', 'price': 'http://www.example.com/price' } df = pd.read_xml("data.xml", xpath="//product:item", namespaces=namespaces) product_names = df["name"] print(product_names)
Output:
0 Laptop 1 Mouse Name: name, dtype: object
This code specifically extracts the product names using the product
namespace prefix.
Reading Only Element Values
The read_xml
function provides the elems_only
parameter to read only the element values, excluding any attributes.
Consider this XML, saved as products.xml
:
<products> <item type="electronics"> <name>Laptop</name> <price>1000</price> </item> <item type="accessory"> <name>Mouse</name> <price>20</price> </item> </products>
Each item
element has a type
attribute along with child elements for the product name and price.
Parse XML Ignoring Attributes
To read only the element values and ignore attributes like type
, you’d set elems_only=True
:
df = pd.read_xml("products.xml", elems_only=True) print(df)
Output:
name price 0 Laptop 1000 1 Mouse 20
For comparison, here’s what happens when you don’t specify elems_only
:
df_default = pd.read_xml("products.xml") print(df_default)
Output:
type name price 0 electronics Laptop 1000 1 accessory Mouse 20
In the default behavior, both the attribute type
and the element values name
and price
are included in the resulting DataFrame.
Reading Only Attribute Values
The attrs_only
parameter in read_xml
function allows you to extract just the attribute values from the XML.
Recall our products.xml
:
<products> <item type="electronics"> <name>Laptop</name> <price>1000</price> </item> <item type="accessory"> <name>Mouse</name> <price>20</price> </item> </products>
Here, we have the type
attribute within each item
element, along with child elements for the product name and price.
Parse XML Focusing on Attributes
To read only attributes and ignore element values, you’d set attrs_only=True
:
df = pd.read_xml("products.xml", attrs_only=True) print(df)
Output:
type 0 electronics 1 accessory
And this is what you’ll get if you don’t specify attrs_only
:
df_default = pd.read_xml("products.xml") print(df_default)
Output:
type name price 0 electronics Laptop 1000 1 accessory Mouse 20
By default, both the type
attribute and the element values name
and price
are extracted.
Specify Column Data Types
The dtype
parameter in read_xml
function lets you control the data types of the resulting columns.
By default, read_xml
tries to infer the appropriate types:
df_default = pd.read_xml("products.xml") print(df_default.dtypes)
Output:
type object name object price int64 dtype: object
The price
column is correctly inferred as an integer (int64
).
Manually Setting Data Types
If you want to specify column data types explicitly, you can use the dtype
parameter:
df = pd.read_xml("products.xml", dtype={"price": "float64", "name": "string"}) print(df.dtypes)
Output:
type object name string[python] price float64 dtype: object
Here, we’ve set the price
column to be a float (float64
) and the name
column as a string type (string
).
Custom Conversion Functions for Specific Columns
With the converters
parameter in read_xml
, you can define custom functions to transform specific columns during the parsing process.
Let’s consider an XML named sales.xml
with the following content:
<sales> <transaction> <product>Laptop</product> <amount>$1000.00</amount> <date>10th Sep 2023</date> </transaction> <transaction> <product>Mouse</product> <amount>$20.00</amount> <date>11th Sep 2023</date> </transaction> </sales>
The amount
values have dollar signs, and the date
values have ordinal indicators, making them non-standard for automatic type inference.
Define Custom Conversion Functions
For our purposes, we’ll create two functions:
convert_currency
: Removes the dollar sign and converts to a float.convert_date
: Parses the non-standard date format.
import pandas as pd from datetime import datetime def convert_currency(value): return float(value.replace('$', '')) def convert_date(value): return datetime.strptime(value, "%dth %b %Y")
Using converters
in read_xml
Now, apply these custom functions using the converters
parameter:
df = pd.read_xml( "sales.xml", converters={ "amount": convert_currency, "date": convert_date } ) print(df)
Output:
product amount date 0 Laptop 1000.0 2023-09-10 1 Mouse 20.0 2023-09-11
The amount
column is now of type float, and the date
column is a datetime object, ready for further analysis.
Parse Specific Columns as Dates
While you can use custom converters as shown earlier, for standard date formats, the parse_dates
parameter in read_xml
provides a more straightforward mechanism.
Consider an XML named events.xml
:
<events> <event> <name>Conference A</name> <date>2023-09-10</date> </event> <event> <name>Workshop B</name> <date>2023-09-12</date> </event> </events>
The date
values are in the YYYY-MM-DD
format, which is standard.
Parse Dates Using parse_dates
To ensure the date
column is parsed as a datetime object, use the parse_dates
parameter:
df = pd.read_xml("events.xml", parse_dates=['date']) print(df.dtypes)
Output:
name object date datetime64[ns] dtype: object
Handling Multiple Date Columns
If your XML contains multiple date columns, you can easily parse all of them by providing a list:
df = pd.read_xml("events.xml", parse_dates=['date', 'another_date_column'])
Inferring and Enforcing dtypes
While Pandas can infer dtypes, sometimes you want more control or specificity.
The dtype_backend
parameter in the read_xml
function provides this level of detail by offering backend-specific dtype inferences.
By default, Pandas uses its dtype inference mechanism. Let’s save our XML data to a file named sample.xml
:
<data> <entry> <value>1</value> </entry> <entry> <value>1.5</value> </entry> </data>
Reading the file, you might observe:
import pandas as pd df = pd.read_xml("sample.xml") print(df.dtypes)
Output:
value float64 dtype: object
Pandas inferred the value
column as a float64
. what if this is not the desired type?
Using the dtype_backend
Parameter
The dtype_backend
parameter allows you select a backend ("numpy_nullable"
or "pyarrow"
) for dtype inference:
df_python = pd.read_xml("data.xml", dtype_backend="numpy_nullable") df_lxml = pd.read_xml("data.xml", dtype_backend="pyarrow") print("NumPy Nullable Backend:", df_python.dtypes) print("pyarrow Backend:", df_lxml.dtypes)
Output:
NumPy Nullable Backend: value Float64 dtype: object pyarrow Backend: value double[pyarrow] dtype: object
With the pyarrow
backend, the value
column is inferred as double[pyarrow]
, which is more relevant in certain scenarios.
Transform XML Using XSLT Stylesheets
Extensible Stylesheet Language Transformations (XSLT) is a powerful tool for transforming XML documents.
The stylesheet
parameter in the read_xml
function lets you achieve this by applying an XSLT stylesheet.
Suppose you have an XML file, employees.xml
:
<employees> <employee> <name>John</name> <role>Developer</role> </employee> <employee> <name>Mary</name> <role>Designer</role> </employee> </employees>
Create an XSLT Stylesheet
If you want to transform the XML to keep only the employee names. The XSLT for this transformation could look like:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> <xsl:template match="role"/> </xsl:stylesheet>
Save this stylesheet as transform.xslt
.
Apply the XSLT Stylesheet Using stylesheet
To use the transformation when reading the XML data:
df = pd.read_xml("employees.xml", stylesheet="transform.xslt") print(df)
Output:
name 0 John 1 Mary
As seen, only the employee names are retained in the resulting DataFrame.
Stream Parsing: Load Large XML Files
Processing large XML files can be resource-intensive, especially when loading the entire document into memory.
Stream parsing processes XML elements as they’re encountered, without loading the entire file into memory.
Pandas provides this capability via the iterparse
parameter in the read_xml
function.
Using iterparse
Suppose you have a substantial XML file named bigdata.xml
:
<data> <polygon> <shape>square</shape> <degrees>360</degrees> <sides>4.0</sides> </polygon> <polygon> <shape>circle</shape> <degrees>360</degrees> </polygon> <polygon> <shape>triangle</shape> <degrees>180</degrees> <sides>3.0</sides> </polygon> </data>
Instead of reading the entire file at once, use iterparse
to handle it incrementally:
df = pd.read_xml('sample.xml', iterparse={"polygon": ["shape", "degrees", "sides"]}) print(df.head())
Output:
shape degrees sides 0 square 360 4.0 1 circle 360 NaN 2 triangle 180 3.0
The interparse
argument is used to specify the structure of the XML and how it should be parsed.
Benchmark interparse
To compare the performance between using iterparse
and without using it for reading a large XML file. Here’s the plan:
- Generate a large XML file.
- Measure the time taken to read the XML file without using
iterparse
. - Measure the time taken to read the XML file using
iterparse
.
import pandas as pd import random import time from io import BytesIO # Step 1: Generate a large XML file num_entries = 1000000 shapes = ["triangle", "square", "pentagon", "hexagon"] xml_data = '' for _ in range(num_entries): shape = random.choice(shapes) xml_data += f'{shape}{random.randint(100, 400)}{random.randint(3, 6)}' xml_data += '' with open("large_sample.xml", "w") as f: f.write(xml_data) # Step 2: Measure time without iterparse start_time = time.time() df1 = pd.read_xml('large_sample.xml') end_time = time.time() without_iterparse_time = end_time - start_time print(f"Time without iterparse: {without_iterparse_time:.4f} seconds") # Step 3: Measure time with iterparse start_time = time.time() df2 = pd.read_xml('large_sample.xml', iterparse={"polygon": ["shape", "degrees", "sides"]}) end_time = time.time() with_iterparse_time = end_time - start_time print(f"Time with iterparse: {with_iterparse_time:.4f} seconds")
Output:
Time without iterparse: 42.4067 seconds Time with iterparse: 35.7642 seconds
As you can see, interparse
parsed the large XML file faster and the time difference is just for 1 million XML entries only.
Resource
https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html
Mokhtar is the founder of LikeGeeks.com. He is a seasoned technologist and accomplished author, with expertise in Linux system administration and Python development. Since 2010, Mokhtar has built an impressive career, transitioning from system administration to Python development in 2015. His work spans large corporations to freelance clients around the globe. Alongside his technical work, Mokhtar has authored some insightful books in his field. Known for his innovative solutions, meticulous attention to detail, and high-quality work, Mokhtar continually seeks new challenges within the dynamic field of technology.