Transforming XML: XML Pipelines

Transforming XML: XML Pipelines

XML transformations refer to the process of converting XML data from one format to another, typically for the purpose of presentation or integration with other systems. This transformation can involve converting XML into various formats such as HTML, CSV, XML itself (with different structure), or other textual formats.

These transformations enable us to select, filter, sort, and reorganize XML data according to specific requirements. This is crucial for extracting relevant information from large XML documents.

Transformations also allow XML data to be converted into other formats, making it more versatile and usable across different systems and applications.

XML is primarily designed for storing and transporting structured data, but it's not inherently formatted for human readability or presentation. Transformations allow us to convert XML into more user-friendly formats like HTML for web display

The two primary languages for performing XML transformations are:

  1. XSLT (eXtensible Stylesheet Language Transformations):

    • XSLT is designed specifically for transforming XML documents into other formats like HTML, XML, or plain text.

    • It operates based on a set of rules defined in an XSLT stylesheet, which specifies how elements and attributes in the XML document should be transformed into the desired output format.

    • XSLT uses XPath to navigate through the XML structure and apply transformation rules defined in templates.

  2. XQuery:

    • XQuery is a query and functional programming language designed for querying and manipulating XML data.

    • It can also be used for transforming XML data into different formats similar to XSLT, but it is more geared towards querying and extracting data from XML documents.

    • XQuery resembles SQL in syntax and capability, allowing for complex querying, filtering, sorting, and transformation of XML data.

Both XSLT and XQuery are powerful tools in XML processing:

  • XSLT is best suited for tasks where the primary goal is to transform XML into structured formats like HTML, often used in web development and document generation.

  • XQuery, on the other hand, is used more for querying XML data and extracting specific information, often integrated into XML database systems and used in scenarios where data extraction and transformation are needed.


Transforming XML Data to HTML Using XSLT and Flask

A Practical Guide

Let's look at a practical example to see how we can use XSLT to transform some sample XML from the New York Philharmonic Orchestra's concert archive available on their github repository here:

https://raw.githubusercontent.com/nyphilarchive/PerformanceHistory/master/Programs/xml/1842-43_TO_1910-11.xml

In this example, I'll guide you through setting up a Python Flask project, installing and configuring the necessary tools, and writing the XSLT code to display concert information in a user-friendly HTML format. By the end of this tutorial, you'll be able to extend and customise your application to explore and present complex XML data in various formats, making it an essential skill for working with XML and web technologies.

This tutorial is designed for macOS, but students using Windows or Linux can follow along with minimal adjustments.

Start by creating a new project directory for our application. Open your terminal and create a new directory for your project:

mkdir concerts && cd concerts

Next create and activate a virtual environment:

python -m venv venv
source venv/bin/activate   
# On Windows, use `venv\Scripts\activate`

Install Flask and lxml

Pip install the necessary packages, first make sure you are using the latest version of pip, pip install --upgrade pip

pip install Flask lxml

Create Directory Structure and Download XML Data

Now create two subdirectory called templates and data where your XSLT file and xml will be stored:

mkdir templates data

Next download the XML file from the New York Philharmonic Orchestra archive, well put this file in our data folder:

curl -o data/concerts.xml https://raw.githubusercontent.com/nyphilarchive/PerformanceHistory/master/Programs/xml/1842-43_TO_1910-11.xml

Create the XSLT File

Next, create a new file table.xsl inside the templates directory:

touch templates/table.xsl

Add the following content to table.xsl This XSL (Extensible Stylesheet Language) file is used to transform an XML document into an HTML document. Let's break down the code with comments for better understanding:

<?xml version="1.0" encoding="UTF-8"?>
<!-- The XML declaration defines the XML version and the character encoding used. -->

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <!-- This is the root element of the XSLT stylesheet. 
         It declares the XSL namespace and specifies that this is version 1.0 of XSLT. -->

    <xsl:output method="html" indent="yes" />
    <!-- This element defines the output method as HTML and specifies that the output should be indented for readability. -->

    <xsl:template match="/">
        <!-- This template matches the root node of the XML document.
             It is the starting point for the transformation. -->

        <html>
            <head>
                <title>NY Philharmonic Concerts</title>
                <!-- The title of the HTML document. -->
            </head>
            <body>
                <h1>NY Philharmonic Concerts</h1>
                <!-- The main heading of the HTML document. -->

                <table border="1">
                    <!-- Creates an HTML table with a border. -->

                    <tr>
                        <th>Season</th>
                        <th>Orchestra</th>
                        <th>Date</th>
                        <th>Venue</th>
                        <th>Time</th>
                        <!-- Table headers for the concert data. -->
                    </tr>

                    <xsl:for-each select="//program">
                        <!-- Iterates over each <program> element in the XML document. -->

                        <tr>
                            <!-- Creates a new table row for each <program> element. -->

                            <td><xsl:value-of select="season"/></td>
                            <!-- Adds a table cell with the value of the <season> element. -->

                            <td><xsl:value-of select="orchestra"/></td>
                            <!-- Adds a table cell with the value of the <orchestra> element. -->

                            <td><xsl:value-of select="concertInfo/Date"/></td>
                            <!-- Adds a table cell with the value of the <Date> element inside <concertInfo>. -->

                            <td><xsl:value-of select="concertInfo/Venue"/></td>
                            <!-- Adds a table cell with the value of the <Venue> element inside <concertInfo>. -->

                            <td><xsl:value-of select="concertInfo/Time"/></td>
                            <!-- Adds a table cell with the value of the <Time> element inside <concertInfo>. -->

                        </tr>
                    </xsl:for-each>
                </table>
            </body>
        </html>
    </xsl:template>
</xsl:stylesheet>

Summary of the XSLT File

The provided XSLT file transforms XML data related to concerts into an HTML table format. Here's a detailed breakdown of its components and functionality:

  1. XML Declaration:

     <?xml version="1.0" encoding="UTF-8"?>
    
    • Specifies the version of XML and the character encoding used.
  2. XSLT Stylesheet Root Element:

     <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    
    • Declares the document as an XSLT stylesheet.

    • Defines the XSL namespace and specifies XSLT version 1.0.

  3. Output Method Declaration:

     <xsl:output method="html" indent="yes" />
    
    • Specifies that the output should be in HTML format.

    • Indicates that the output should be indented for better readability.

  4. Template for Root Node:

     <xsl:template match="/">
    
    • Matches the root node of the XML document.

    • Serves as the starting point for the transformation.

  5. HTML Document Structure:

     <html>
         <head>
             <title>NY Philharmonic Concerts</title>
         </head>
         <body>
             <h1>NY Philharmonic Concerts</h1>
             <table border="1">
                 <tr>
                     <th>Season</th>
                     <th>Orchestra</th>
                     <th>Date</th>
                     <th>Venue</th>
                     <th>Time</th>
                 </tr>
                 ...
             </table>
         </body>
     </html>
    
    • Constructs the basic HTML structure including the <html>, <head>, and <body> tags.

    • Sets the document title to "NY Philharmonic Concerts".

    • Adds a main heading (<h1>) with the same title.

    • Creates an HTML table with a border.

  6. Table Headers:

     <tr>
         <th>Season</th>
         <th>Orchestra</th>
         <th>Date</th>
         <th>Venue</th>
         <th>Time</th>
     </tr>
    
    • Defines the headers for the table columns: Season, Orchestra, Date, Venue, and Time.
  7. Iteration Over XML Data:

     <xsl:for-each select="//program">
    
    • Iterates over each <program> element in the XML document.
  8. Table Rows for Each Program:

     <tr>
         <td><xsl:value-of select="season"/></td>
         <td><xsl:value-of select="orchestra"/></td>
         <td><xsl:value-of select="concertInfo/Date"/></td>
         <td><xsl:value-of select="concertInfo/Venue"/></td>
         <td><xsl:value-of select="concertInfo/Time"/></td>
     </tr>
    
    • For each <program> element:

      • Creates a new table row (<tr>).

      • Adds table cells (<td>) for the values of <season>, <orchestra>, <concertInfo/Date>, <concertInfo/Venue>, and <concertInfo/Time>.

      • Uses <xsl:value-of> to extract and insert the text content of the specified XML elements.

This XSLT file is designed to take an XML document containing concert information and transform it into a well-structured HTML table. It processes each <program> element in the XML, extracting relevant details such as season, orchestra, date, venue, and time, and displaying them in a tabular format on a web page. This setup ensures that concert data is presented clearly and accessibly to users.


Create the Flask Application

Create a file named app.py in the root of your concerts directory:

touch app.py

Add the following content to app.py

from flask import Flask, render_template_string
import lxml.etree as ET

# Create a Flask application instance
app = Flask(__name__)

# Define the route for the root URL
@app.route('/')
def home():
    # Load the XML and XSL files
    xml_path = 'data/concerts.xml'
    xsl_path = 'templates/table.xsl'

    # Parse the XML and XSL files
    xml_tree = ET.parse(xml_path)
    xsl_tree = ET.parse(xsl_path)

    # Create an XSLT transformer
    transform = ET.XSLT(xsl_tree)

    # Apply the transformation
    result_tree = transform(xml_tree)

    # Render the result as a string and return as the HTTP response
    return render_template_string(str(result_tree))

# Run the Flask application
if __name__ == '__main__':
    app.run(debug=True, port=8088)

Explanation of app.py

  • Imports:

      from flask import Flask, render_template_string
      import lxml.etree as ET
    
    • Flask and render_template_string are imported from the flask module to create the web server and render the HTML response.

    • lxml.etree is imported as ET to handle XML parsing and XSLT transformations.

  • Flask Application Instance:

      app = Flask(__name__)
    
    • An instance of the Flask application is created.
  • Route Definition:

      @app.route('/')
      def home():
    
    • The home function is defined to handle requests to the root URL (/).
  • Load and Parse XML and XSL Files:

      xml_path = 'data/concerts.xml'
      xsl_path = 'templates/table.xsl'
      xml_tree = ET.parse(xml_path)
      xsl_tree = ET.parse(xsl_path)
    
    • The paths to the XML and XSL files are defined.

    • ET.parse is used to parse the XML and XSL files into tree structures.

  • Create XSLT Transformer and Apply Transformation:

      transform = ET.XSLT(xsl_tree)
      result_tree = transform(xml_tree)
    
    • An XSLT transformer is created using the parsed XSL tree.

    • The transformation is applied to the XML tree, resulting in a new tree structure (result_tree).

  • Render and Return the Result:

      return render_template_string(str(result_tree))
    
    • The transformed result is converted to a string and rendered as the HTTP response using render_template_string.
  • Run the Flask Application:

      if __name__ == '__main__':
          app.run(debug=True, port=8088)
    

The Flask application is started, listening on port 8088 with debug mode enabled.


Run your flask application

Ensure you are in the concerts directory and activate the virtual environment if not already activated:

python app.py

Finally, open your web browser and go to http://localhost:8088 to see the transformed HTML.

You should see the NY Philharmonic Concerts listed in tabular format:


Transforming XML Data to HTML Using XQuery and Flask

Below is a tutorial for creating a Flask application that uses XQuery to transform XML data and display it as an HTML table. The example assumes the same XML structure as before, with details about concerts by the NY Philharmonic.

As before create a project directory, this time called flask_xquery activate a virtual environment and pip install Flask and lxml. Your project directory should look something like this:

flask_xquery/
├── app.py
├── static/
│   └── styles.css
├── data/
    └── concerts.xml

Place your concerts.xml data in the data/concerts.xml file.

Create the Flask Application

Create the app.py file to handle routes and render templates. Instead of using a separate .xq file, we will embed the XQuery logic directly in your Flask route.

from flask import Flask, render_template
from lxml import etree

app = Flask(__name__)

@app.route('/')
def index():
    # Define the XQuery directly within Python
    xquery = '''
    xquery version "3.1";
    <results>
        <html>
            <head>
                <title>NY Philharmonic Concerts</title>
            </head>
            <body>
                <h1>NY Philharmonic Concerts</h1>
                <table border="1">
                    <tr>
                        <th>Season</th>
                        <th>Orchestra</th>
                        <th>Date</th>
                        <th>Venue</th>
                        <th>Time</th>
                    </tr>
                    {
                        for $program in doc("data/concerts.xml")//program
                        return
                            <tr>
                                <td>{ $program/season }</td>
                                <td>{ $program/orchestra }</td>
                                <td>{ $program/concertInfo/Date }</td>
                                <td>{ $program/concertInfo/Venue }</td>
                                <td>{ $program/concertInfo/Time }</td>
                            </tr>
                    }
                </table>
            </body>
        </html>
    </results>
    '''

    # Parse and execute the XQuery using lxml
    xquery_result = etree.XSLT(etree.XML(xquery))

    # Convert lxml object to string for Flask rendering
    html_result = str(xquery_result)

    return html_result

if __name__ == '__main__':
    app.run(debug=True)

XQuery Definition: The XQuery is defined as a string (xquery) directly within the index() function. It, is querying concerts.xml for concert details and formatting them into an HTML table.

Execution: The XQuery string is parsed and executed using lxml. The resulting HTML table is converted to a string (html_result) for rendering via Flask.

Run the Flask Application

Run your Flask application from the terminal:

python app.py

Resources