This article assumes basic knowledge of XML, parsing XML (i.e. SAX and DOM, experience with a Java implementation like Apache Xerces helpful) and the OpenDocument Format as this will play an important role in this article.

Download the Source

150k JARfile, source files (useful in Eclipse/Netbeans)

Implementing an XML to ODT Converter


Converting XML documents to some other file format is most often done using the declarative transforming language XSLT. In this article we will see an alternative way to achieve such a task.

The underlying XML format is the so-called "M_doc" XML-schema, which is, basically, a mixture of some well-known HTML tags and LaTeX code (but in XML syntax). The target format is the OpenDocument Format, the one that OpenOffice.org is using to save documents. We will utilize a SAX-parser for reading the input document and the open-source API ODFDOM, written in Java, for writing the output text document (file extension: odt).
The source code of the implemented program mdoc2odf can be downloaded as attachment here.

Obviously, the first thing to do when implementing a converter is to study both the source file format and the target file format. In this case, the input format, M_doc, is described syntactically by a slim document type definition (DTD). The semantics can be derived by the name of the elements/attributes as they are inspired by HTML and LaTeX.
The OpenDocument Format is defined by a huge technical description (can be found here), but considering its size, reading some tutorials and wiki entries is a more pragmatic way. Moreover, writing sample files with OOWriter and examining its contents helps a lot.

The next step is to think about the general program design. XML files can be parsed by either DOM or SAX. We choose SAX, because it doesn't consume as much memory as DOM and is significantly faster, although it comes with a few inconveniences.
Usage of a SAX-parser in a Java program is quite simple. The Java API provides an interface-class named org.xml.sax.ContentHandler that needs to be implemented and given to the parser instance. SAX is event-driven. The parser reads the XML file sequentially and fires an event each time an XML-element is encountered. For instance, whenever a starting element is read, the element's name and its attributes are passed to the ContentHandler. This is done by invoking the corresponding method you implemented in your ContentHandler. For more information please refer to the numerous SAX parsing tutorials.
Our ContentHandler, we call it ODFConverter, will be the core class of our program. Besides implementing the SAX-notification methods it will also have the methods to begin the conversion process (i.e. to convert a concrete M_doc XML document). An important notice is that the SAX-parser is playing the active role while our ContentHandler plays a passive role. This means we cannot directly influence the program flow. The flow is given by the parser, according to the Hollywood Principle (don't call us, we'll call you).

The main task is to decide which kind of ODF element has to be added to the output document by looking at the name of the current XML element in the input document. We get this information in the method handling the start-element event. A naive approach would be a very long if-else cascade performing many string comparisons. However, this wouldn't be very performant. It makes sense to set up a repository for all recognized elements. This could be done with some kind of lookup-table, but what we will do is use Java's reflection API. ODFConverter, will delegate creation of ODF content to other modules, this means it also serves as a dispatcher. For each M_doc XML element we implement a method whose name is composed by a prefix ("start" or "end") and the element-name. For example, parsing the start-element <table> will result in invoking the method startTable and in fact, this is what ODFConverter does. The method called will then dispatch to an underlying module, in this example, the module responsible for creating ODF compliant tables.

ODFConverter (extends org.xml.sax.helpers.DefaultHandler)
@Override
public void startElement(String uri, String localName,
			 String qName, Attributes attributes) throws SAXException {
	xPath += "/" + qName;

	Class<? extends ODFConverter> clazz = this.getClass();

	String methodName = "start" + nameToMethod(qName);
	// "element" -> "startElement"

	try {
		Method method = clazz.getMethod(methodName, Attributes.class);
		method.invoke(this, attributes);
	} catch (NoSuchMethodException e) {
		// not all elements need to be recognized.
	} ...
}

ODFConverter has reference fields to several delegation modules. The method invoked via reflection delegates to such a module.
Let's say the SAX-parser finds the following element: <link url="http://www.someurl.org"> ...
It will then call the start-element notification method (illustrated above) which leads to invocation of the method startLink:

ODFConverter.java
	public void startLink(Attributes attributes) {
		getContentWriter().newLink(attributes.getValue("url"), attributes.getValue("target"));
	}

ContentWriter is one of those mentioned delegation modules that will (among others) add ODF compliant hyperlinks to the odt document. This document is an object obtained by the ODFDOM API, aggregated in ODFConverter and passed to the delegation modules on initialization.

A disadvantage of a SAX-parser is its statelessness. At any point during the parsing process we only know what element or what text we are currently reading. We don't know the parent or child elements, in other words, we have no random access. To compensate this a bit, we keep track of the context by holding a string variable that stores the current xPath and is updated whenever an XML element is opened or closed. We also need to remember, where we have inserted our last ODF content in the output document, a cursor. This will be a pointer to a node in the output DOM object, i.e., org.w3c.dom.Node. When a delegation module is instructed to add some particular ODF element to the output document, it still has to decide how exactly the new element is inserted. This depends on the actual kind of the element the cursor is pointing at. The ODFDOM API contains a class for every XML element defined in the ODF specification, so we can use the instanceof operator. We have to do this distinction, because, for example, the required tags for an image need something like a paragraph as containing parent-element and thus cannot be blindly added to the current position. The last element might be a list item and to insert the image, we have to insert a paragraph first. Similarly, when adding plain text, we need to check if the last element found is allowed to have a text node as child. A typical insert operation looks like this:

Insert operation (delegation module: ImageHandler)
public void appendImage(OdfDrawFrame imageFrame) {
	OdfElement node = (OdfElement) getCursor();

	if (canHaveDrawFrame(node)) {
		node.appendChild(imageFrame);
	} else {
		appendAsParagraphChild(node, imageFrame);
	}
}

The boolean method canHaveDrawFrame(Node) uses the instanceof-operator to see if the cursor is an ODF element that is allowed to have a <draw:frame> as child. If it is not, the frame is attached to a paragraph.

Some implementation details

The following section will show some details of this specific implementation and only serves as a depiction.

Tables & Lists

Using the cursor to append content is suitable in most cases. For structures like tables and lists, I will show a better and more reliable technique using good old stacks. Lists are composed by a root list element (e.g. <ol>, <ul>) and several list-item elements (<li>) and they can be nested (potentially infinitely, but limited to a depth of 10 by ODF specification). Every time a new list is started we push it onto the stack and pop when it is closed. We do the same with the list items (we have 2 stacks, one for the root list elements and one for all items). The well-formedness of XML guarantees that we will never pop on an empty stack. The stacksize always tells us on which list level we currently are. When creating a new list, we check the status of the list-item stack. If it is empty we use the cursor to append the list, otherwise we append the list to the top element on the item stack. New items are appended to the top element on the list stack (which cannot be empty since a list item can only occur in a list environment).
This procedure can be applied to tables aswell, we just need an additional stack (<table>, <tr> and <td> elements).

Document links

If you are familiar with LaTeX you certainly know labels (\label{abc}) and references to these labels (\ref{abc}). Setting a label somewhere in the tex document allows referencing the chapter number from somewhere else, and including a hyperref package automatically makes it clickable, so the reader can jump to the referenced chapter. The ODF spec. has an equivalent feature (bookmarks and bookmark-refs) we want to make use of.
LaTeX processors write index files to resolve references, hence references to a label occurring later in the document are undefined in the first attempt and you need to run the processor once more. The second run will find the references written down during the first run.
However, our conversion process differs from the LaTeX compiling process. The source file is read exactly once and we don't want to change that. We need to remember any unresolved references until their definition. Associative arrays (java.lang.Map with String as key type) are well suited for this job. The reference is put into the map, where the reference string (name of the referenced label) is its key. As soon as the corresponding label is parsed, the undefined bookmark-references are fetched from the map (map.get(labelstring)) and updated with the actual chapter number.

Compatibility issues

Converting between two not fully-compatible document formats most likely yields problems, forcing the developer to build a workaround. A nice example is the ODF specification not supporting tables inside lists while both HTML and LaTeX allow this. Consider the following M_doc example:

Example: table inside list
<ul>
  <li>
    list item
  </li>
  <li>
    <table>
      <tr>
        <td>...</td>
      </tr>
    </table>
  </li>
  ...
</ul>

 Before I discovered this curiosity, the implemented program produced the expected XML-code:

produced output ODT (content.xml)
<text:list>
  <text:list-item>
    <text:p>list item</text:p>
  </text:list-item>
  <text:list-item>
    <table:table>
      <table:table-column/>
      <table:table-row>
        <table:table-cell>...</table:table-cell>
      </table:table-row>
    </table:table>
  </text:list-item>
  ...
</text:list>

However, opening this file reveals the table is being omitted. OpenOffice.org ignores it, since such a structure is not expected due to the ODF specification not allowing it, for whatever reason... So, what can we do? Well, let's see what OOWriter does, when we write a list with some items and then try to place a table inside. The answer is: OOWriter interrupts the list and resumes it behind the table. Fortunately, all we need to do is close the list, insert the table, create a new list and use the XML-attribute continue-numbering in the second list and set it to true.

Further Readings

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.




PageRank verified www.micromata.de/