XML Parsers

In this tutorial, the basic concept of XML parsing is described.

All modern browsers have a built-in XML parser that can be used to read and manipulate XML. The traditional techniques for parsing XML files are:

         1. Simple API for XML (SAX)

         2. Document Object Model (DOM)

We use Java as the programming language to provide support for SAX & DOM APIs. The related description is given one-by-one. 

 

1. SAX Parser:

SAX is an event-driven, serial access parser API for XML and it provides a mechanism for reading data from an XML document. It is a popular alternative to the Document Object Model (DOM). It parses XML file step by step so much suitable for large XML Files. SAX Parser fires an event when it encountered opening tag, element or attribute, and the parsing works accordingly. It’s recommended to use SAX XML parser for parsing large XML files in Java because it doesn’t require to load whole XML file in Java and it can read a big XML file in small parts. Java provides support for SAX parser and we can parse any XML file in Java using SAX Parser.

 

2. DOM Parser:

DOM parser loads the entire XML file in memory and creates a tree structure of XML document. DOM implementations tend to be memory intensive, as they generally require the entire document to be loaded into memory and constructed as a tree of objects before access is allowed. If we know that we have sufficient amount of memory then better to choose DOM as this is faster to access. That is why for small and medium sized XML documents, DOM is much faster than SAX.

DOM is a cross-platform and language-independent convention for representing and interacting with objects in HTML, XHTML and XML documents. Objects under the DOM (also sometimes called “elements”) may be specified and addressed according to the syntax and rules of the programming language used to manipulate them. The rules for programming and interacting with the DOM are specified in the DOM API.

 

Differences between SAX and DOM

Both SAX and DOM are used to parse the XML document. Both have advantages and disadvantages and can be used in our programming depending on the situation:

TOPIC

SAX Parser

DOM Parser

1) Abbreviation

SAX stands for Simple API for XML Parsing

DOM stands for Document Object Model

2) Type

Event based serial access parser

Loads entire memory and keep in tree structure

3) Size of document

Good to choose for larger size file

Good for smaller and medium size

4) Load

Does not load entire document

Loads entire document in memory

5) Suitable

SAX is suitable for larger XML document

Better suitable for smaller and efficient memory

 

Practical Examples using SAX and DOM Parser

Let us consider the following sample web.xml file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<web-app>
    <servlet>
        <servlet-name>Example</servlet-name>
        <servlet-class>LoginServlet</servlet-class>
    </servlet>
    <servlet-mapping>
        <servlet-name>Example</servlet-name>
        <url-pattern>/ExampleTest</url-pattern>
    </servlet-mapping>
    <servlet>
        <servlet-name>Another</servlet-name>
        <servlet-class>AnotherServlet</servlet-class>
    </servlet>
    <servlet-mapping>
        <servlet-name>Another</servlet-name>
        <url-pattern>/AnotherTest</url-pattern>
    </servlet-mapping>
</web-app>

Now, we can parse the above XML file (using SAX and DOM parsers both) to count the number of <servlet> tags with the help of Apache parser. The Apache parser version 1.2.3 (commonly known as Xerces) is an open-source effort based on IBM’s XML4J parser. Xerces has full support for the W3C Document Object Model (DOM) Level 1 and the Simple API for XML (SAX) 1.0 and 2.0; however it currently has only limited support for XML Schemas, DOM Level 2 (version 1). We have to add the xerces.jar file to our CLASSPATH to use the parser.

The following example shows a minimal program that counts the number of <servlet> tags in an XML file using the DOM. The second import line specifically refers to the Xerces parser. The main method creates a new DOMParser instance and then invokes its parse() method. If the parse operation succeeds, we can retrieve a Document object through which we can access and manipulate the DOM tree using standard DOM API calls. This simple example retrieves the “servlet” nodes and prints out the number of nodes retrieved.

The example codes are placed within xml directory under C:\.

import org.w3c.dom.*;
import org.apache.xerces.parsers.DOMParser;

public class DOM {
public static void main(String[] args) {
try {
DOMParser parser = new DOMParser();
parser.parse(args[0]);
Document doc = parser.getDocument();
NodeList nodes = doc.getElementsByTagName("servlet");
System.out.println("There are " + nodes.getLength() + <servlet> elements.");
}
catch (Exception ex) {
System.out.println(ex);
}
}
}

We can use SAX to accomplish the same task. SAX is event-oriented. In the following example, inherits from DefaultHandler, which has default implementations for all the SAX event handlers, and overrides two methods: startElement() and endDocument(). The parser calls the startElement() method each time it encounters a new element in the XML file. In the overridden startElement method, the code checks for the “servlet” tag, and increments the tagCounter counter variable.

When the parser reaches the end of the XML file, it calls the endDocument() method. The code prints out the counter variable at that point. We can set the ContentHandler and the ErrorHandler properties of the the SAXParser() instance in the main() method, and then use the parse() method to start the actual parsing.

import org.xml.sax.*;
import org.xml.sax.helpers.DefaultHandler;
import org.apache.xerces.parsers.SAXParser;

public class SAX extends DefaultHandler  {    
      int tagCount = 0;
     public void startElement(String uri, String localName, String rawName, Attributes attributes) {
            if (rawName.equals("servlet")) {
               tagCount++;
            }
      }
     public void endDocument()       {
           System.out.println("There are " + tagCount + " <servlet> elements.");
      }
     public static void main(String[] args) {
            try {
                  SAX SAXHandler = new SAX();
                  SAXParser parser = new SAXParser();
                  parser.setContentHandler(SAXHandler);
                  parser.setErrorHandler(SAXHandler);
                  parser.parse(args[0]);
            }
            catch (Exception ex) {
                        System.out.println(ex);
            }
      }
}

First of all, we have to place the file xerces.jar into our working directory (C:\xml). Then we have to set our CLASSPATH variable so that all the classes and interfaces are available to our Java programs. Now we can compile and run the above programs as follows:

Output

C:\xml>set classpath=C:\xml\xerces.jar;

C:\xml>javac *.java

C:\xml>java DOM web.xml
There are 2 <servlet> elements.

C:\xml>java SAX web.xml
There are 2 <servlet> elements.