Correctness Levels of XML

An XML document has two correctness levels:

1. Well-formed. A well-formed document conforms to the XML syntax rules; e.g. if a start-tag (< >) appears without a corresponding end-tag (</>), it is not well-formed. A document not well-formed is not in XML; a conforming parser is disallowed from processing it.

2. Valid. A valid document additionally conforms to semantic rules, either user-defined or in an XML schema, especially DTD; e.g. if a document contains an undefined element, then it is not valid; a validating parser is disallowed from processing it.

1. Well-formed:

XML with correct syntax is “Well-Formed” XML.

If only a well-formed element is required, XML is a generic framework for storing any amount of text or any data whose structure can be represented as a tree. The only indispensable syntactical requirement is that the document has exactly one root element (also known as the document element), i.e. the text must be enclosed between a root start-tag and a corresponding end-tag, known as a “well-formed” XML document:

<book>This is a book… </book>

The root element can be preceded by an optional (for XML 1.0 only) XML declaration element stating what XML version is in use (normally 1.0); it might also contain character encoding and external dependencies information. Starting with XML version 1.1, this declaration becomes mandatory. This is necessary as an XML document without an XML declaration is assumed to be a version 1.0 document.

<?xml version=”1.0″ encoding=”UTF-8”?>

The specification requires that processors of XML support the pan-Unicode character encoding UTF-8 and UTF-16 (UTF-32 is not mandatory).

Comments can be placed anywhere in the tree, including in the text if the content of the element is text or #PCDATA. XML comments start with <!– and end with –>. Two consecutive dashes (–) may not appear anywhere in the text of the comment.

<!– This is a comment. –>

In any meaningful application, additional markup is used to structure the contents of the XML document. The text enclosed by the root tags may contain an arbitrary number of XML elements. The basic syntax for one element is:

<element_name  attribute_name=”attribute_value”>Element Content</element_name>

Therefore a “well-formed” XML document has correct XML syntax. The syntax rules are summarized below:

  • XML documents must have a root element.
  • XML elements must have a closing tag.
  • XML tags are case sensitive.
  • XML elements must be properly nested.
  • XML attribute values must be quoted.

 

2. Valid:

XML validated against a DTD is “valid” XML.

If only a well-formed element is required, XML is a generic framework for storing any amount of text or any data whose structure can be represented as a tree. An XML document that complies with a particular Schema / DTD, in addition to being well-formed, is said to be valid. Before the advent of generalized data description languages such as SGML and XML, software designers had to define special file formats or small languages to share data between programs. This required writing detailed specifications and special-purpose parsers and writers.

XML’s regular structure and strict parsing rules allow software designers to leave parsing to standard tools, and since XML provides a general, data model-oriented framework for the development of application-specific languages, software designers need only concentrate on the development of rules for their data, at relatively high levels of abstraction.

Well-tested tools exist to validate an XML document “against” a schema: the tool automatically verifies whether the document conforms to constraints expressed in the schema. Some of these validation tools are included in XML parsers, and some are packaged separately. Other usages of schemas exist: XML editors, for instance, can use schemas to support the editing process (by suggesting valid elements and attributes names, etc).

Displaying on the web

Generally, generic XML documents do not carry information about how to display the data. Without using CSS or XSLT (Extensible Style-sheet Transformations), a generic XML document is rendered as raw XML text by most web browsers. Some display it with ‘handles’ (e.g. + and – signs in the margin) that allow parts of the structure to be expanded or collapsed with mouse-clicks.

In order to style the rendering in a browser with CSS, the XML document must include a reference to the stylesheet:

<?xml-stylesheet type=”text/css” href=”myStyleSheet.css”?>

Note that this is different from specifying such a stylesheet in HTML, which uses the <link> element.

XSLT (XSL Transformations) can be used to alter the format of XML data, either into HTML or other formats that are suitable for a browser to display.

To specify client-side XSLT, the following processing instruction is required in the XML:

<?xml-stylesheet type=”text/xml” href=”myTransform.xslt”?>

Client-side XSLT is supported by many web browsers. Alternatively, one may use XSLT to convert XML into a displayable format on the server rather than being dependent on the end-user’s browser capabilities. The end-user is not aware of what has gone on ‘behind the scenes’; all they see is well-formatted, displayable data.