Export XML to CSV using Python Pandas

In this tutorial, you’ll learn how to export XML to CSV using Pandas in Python.

The following examples will provide different scenarios where XML data varying in complexity from basic structures to deeply nested and attribute-rich formats, are transformed into CSV files using Pandas.

We’ll use Pandas read_xml() to read the XML content and Pandas to_csv() to export the data to CSV format.

 

 

Basic XML to CSV

This example deals with a straightforward XML file where data is organized in a simple hierarchical structure.

Sample XML Content:

<records>
  <record>
    <id>1</id>
    <name>John Doe</name>
    <email>john@example.com</email>
  </record>
  <record>
    <id>2</id>
    <name>Jane Doe</name>
    <email>jane@example.com</email>
  </record>
</records>

Code:

import pandas as pd
df = pd.read_xml('sample_data.xml')
df.to_csv('output.csv', index=False)

Output:

id,name,email
1,John Doe,john@example.com
2,Jane Doe,jane@example.com

 

 

Nested XML Elements

You can use the xml.etree.ElementTree module to parse the XML file.

It iterates over each node, extracts the relevant data, and stores it in a dictionary. Finally, the dictionary is converted into a DataFrame and exported to a CSV file.

Sample XML Content:

<employees>
  <employee>
    <id>101</id>
    <name>Emily</name>
    <contact>
      <email>emily@example.com</email>
      <phone>1234567890</phone>
    </contact>
  </employee>
  <employee>
    <id>102</id>
    <name>James</name>
    <contact>
      <email>james@example.com</email>
      <phone>0987654321</phone>
    </contact>
  </employee>
</employees>

Code:

import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()

# Extract the data from XML and flatten it into a dictionary
data = []
for employee in root.findall('employee'):
    employee_data = {
        'id': employee.find('id').text,
        'name': employee.find('name').text,
        'email': employee.find('contact/email').text,
        'phone': employee.find('contact/phone').text
    }
    data.append(employee_data)
df = pd.DataFrame.from_records(data)
df.to_csv('nested_output.csv', index=False)

Output:

id,name,email,phone
101,Emily,emily@example.com,1234567890
102,James,james@example.com,0987654321

 

XML with Attributes

This example focuses on XML files where elements have attributes.

Sample XML Content:

<products>
  <product id="p001">
    <name>Widget</name>
    <price>19.99</price>
  </product>
  <product id="p002">
    <name>Gadget</name>
    <price>29.99</price>
  </product>
</products>

Code:

import pandas as pd
df = pd.read_xml('input.xml', xpath='/products/product')
df.to_csv('output.csv', index=False)

Output:

id,name,price
p001,Widget,19.99
p002,Gadget,29.99

 

XML with Mixed Content

This example covers an XML file with a mix of attributes and text content within elements.

Sample XML Content:

<inventory>
  <item type="electronic">
    <name>Laptop</name>
    <quantity>10</quantity>
    <price>1200</price>
  </item>
  <item type="furniture">
    <name>Chair</name>
    <quantity>50</quantity>
    <price>100</price>
  </item>
</inventory>

Code:

import pandas as pd
df = pd.read_xml('data.xml', xpath='//item')
df.to_csv('output.csv', index=False)

Output:

type,name,quantity,price
electronic,Laptop,10,1200
furniture,Chair,50,100

 

XML with Complex Hierarchies

This example demonstrates handling an XML file with complex hierarchies and multiple levels of nested elements.

Sample XML Content:

<hospital>
  <ward name="Cardiology">
    <patient>
      <name>Edward</name>
      <diagnosis>Arrhythmia</diagnosis>
      <treatment>Medication</treatment>
    </patient>
    <patient>
      <name>Nancy</name>
      <diagnosis>Myocarditis</diagnosis>
      <treatment>Rest</treatment>
    </patient>
  </ward>
  <ward name="Neurology">
    <patient>
      <name>Oliver</name>
      <diagnosis>Migraine</diagnosis>
      <treatment>Medication</treatment>
    </patient>
  </ward>
</hospital>

Code:

import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
ward_names = []
patient_names = []
diagnoses = []
treatments = []
for ward in root.findall('ward'):
    ward_name = ward.get('name')
    for patient in ward.findall('patient'):
        ward_names.append(ward_name)
        patient_names.append(patient.find('name').text)
        diagnoses.append(patient.find('diagnosis').text)
        treatments.append(patient.find('treatment').text)
data = {'ward_name': ward_names, 'patient_name': patient_names, 'diagnosis': diagnoses, 'treatment': treatments}
df = pd.DataFrame(data)
df.to_csv('output.csv', index=False)

Output:

ward_name,patient_name,diagnosis,treatment
Cardiology,Edward,Arrhythmia,Medication
Cardiology,Nancy,Myocarditis,Rest
Neurology,Oliver,Migraine,Medication

 

Exporting XML with Namespace to CSV

Handling XML files with namespaces can be a bit tricky, as they add another layer of complexity.

This example demonstrates how to process an XML file with namespaces and export it to a CSV format using Pandas.

Sample XML Content:

<ns:company xmlns:ns="http://www.example.com/ns">
  <ns:employee>
    <ns:id>001</ns:id>
    <ns:name>Lisa</ns:name>
    <ns:role>Developer</ns:role>
  </ns:employee>
  <ns:employee>
    <ns:id>002</ns:id>
    <ns:name>Tom</ns:name>
    <ns:role>Designer</ns:role>
  </ns:employee>
</ns:company>

Code:

import pandas as pd
namespaces = {
    'ns': 'http://www.example.com/ns'
}
df = pd.read_xml('namespace_data.xml', xpath='//ns:employee', namespaces=namespaces)
df.to_csv('namespace_output.csv', index=False)

Output:

id,name,role
1,Lisa,Developer
2,Tom,Designer

 

Exporting Specific XML Elements to CSV

Sometimes, you may only need to export specific elements from an XML file to a CSV, instead of the entire dataset.

Sample XML Content:

<catalog>
  <product>
    <name>Widget A</name>
    <category>Tools</category>
    <price>19.99</price>
  </product>
  <product>
    <name>Gadget B</name>
    <category>Electronics</category>
    <price>29.99</price>
  </product>
  <product>
    <name>Device C</name>
    <category>Gadgets</category>
    <price>39.99</price>
  </product>
</catalog>

Code:

import pandas as pd
df = pd.read_xml('catalog_data.xml', xpath='//product')
selected_columns = df[['name', 'price']]
selected_columns.to_csv('selected_elements_output.csv', index=False)

Here we specified the columns we want to export “name” and “price”.

Leave a Reply

Your email address will not be published. Required fields are marked *