Chapter 1. The OpenOffice.org File Format

In this chapter, we will discuss not only the “what” of the OpenOffice.org file format, but also the “why.” Thus, this chapter is as much evangelism as explanation.

Before we can talk about OpenOffice.org, we have to look at the current state of proprietary office suites and applications. In this world, all your documents are stored in a proprietary (often binary) format. As long as you stay within the office suite, this is not a problem. You can transfer data from one part of the suite to another; you can transfer text from the word processor to a presentation, or you can grab a set of numbers from the spreadsheet and convert it to a table in your word processing document.

The problems begin when you want to do a transfer that wasn’t intended by the authors of the office suite. Because the internal structure of the data is unknown to you, you can’t write a program that creates a new word processing document consisting of all the headings from a different document. If you need to do something that wasn’t provided by the software vendor, or if you must process the data with an application external to the office suite, you will have to convert that data to some neutral or “universal” format such as Rich Text Format (RTF) or comma-separated values (CSV) for import into the other applications. You have to rely on the kindness of strangers to include these conversions in the first place. Furthermore, some conversions can result in loss of formatting information that was stored with your data.

Note also that your data can become inaccessible when the software vendor moves to a new internal format and stops supporting your current version. (Some people actually suggest that this is not cause for complaint since, by putting your data into the vendor’s proprietary format, the vendor has now become a co-owner of your data. This is, and I mean this in the nicest possible way, a dangerously idiotic idea.)

Although the XML file format is human-readable, it is fairly verbose. To save space, OpenOffice.org files are stored in JAR (Java Archive) format. A JAR file is a compressed ZIP file that has an additional “manifest” file that lists the contents of the archive. Since all JAR files are also ZIP files, you may use any ZIP file tool to unpack the OpenOffice.org document and read the XML directly. Figure 1.1, “Text Document” shows a short word processing document, which we have saved with the name firstdoc.sxw.

Example 1.1, “Listing of Unzipped Text Document” shows the results of unzipping this file in Linux; the date, time, and CRC columns have been edited out to save horizontal space.

These files are, in order:

mimetype

This file has a single line of text which gives the MIME type for the document.The various MIME types are summarized in Table 1.1, “MIME Types for OpenOffice.org Documents”.

content.xml

The actual content of the document

styles.xml

This file contains information about the styles used in the content. The content and style information are in different files on purpose; separating content from presentation provides more flexibility.

meta.xml

Meta-information about the content of the document (such things as author, last revision date, etc.) This is different from the META-INF directory.

settings.xml

This file contains information that is specific to the application. Some of this information, such as window size/position and printer settings is common to most documents. A text document would have information such as zoom factor, whether headers and footers are visible, etc. A spreadsheet would contain information about whether column headers are visible, whether cells with a value of zero should show the zero or be empty, etc.

META-INF/manifest.xml

This file gives a list of all the other files in the JAR. This is meta-information about the entire JAR file.

We will discuss the meta.xml, settings.xml, and style.xml files in greater detail in the next chapter, and the remainder of the book will cover the various flavors of the content.xml file.

First, let’s look at the contents of manifest.xml, most of which is self-explanatory.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE manifest:manifest PUBLIC "-//OpenOffice.org//DTD Manifest
1.0//EN" "Manifest.dtd">
<manifest:manifest xmlns:manifest="http://openoffice.org/2001/manifest">

    <manifest:file-entry manifest:media-type="application/vnd.sun.xml.writer"
                         manifest:full-path="/" />
    <manifest:file-entry manifest:media-type=""
        manifest:full-path="Pictures/" />
    <manifest:file-entry manifest:media-type="text/xml"
        manifest:full-path="content.xml" />
    <manifest:file-entry manifest:media-type="text/xml"
        manifest:full-path="styles.xml" />
    <manifest:file-entry manifest:media-type="text/xml"
        manifest:full-path="meta.xml" />
    <manifest:file-entry manifest:media-type="text/xml"
        manifest:full-path="settings.xml" />
</manifest:manifest>

The manifest:media-type for the root directory tells what kind of file this is. Its content is the same as the content of the mimetype file, as shown in Table 1.1, “MIME Types for OpenOffice.org Documents”.

There is an entry for a Pictures directory, even though there are no images in the file. If there were an image, the unzipped file would contain a Pictures directory, and the relevant portion of the manifest would now look like this:


    <manifest:file-entry manifest:media-type="image/png"
        manifest:full-path="Pictures/100002000000002000000020DF8717E9.png" />
    <manifest:file-entry manifest:media-type=""
        manifest:full-path="Pictures/" />

If you have included OpenOffice.org BASIC scripts, your .jar file will include a Basic directory, and the manifest will describe it and its contents.

If you are building your own document with embedded objects (charts, pictures, etc.) you must keep track of them in the manifest file, or OpenOffice.org will not be able to find them.

The manifest.xml used the manifest namespace for all of its element and attribute names. OpenOffice.org uses a large number of namespace declarations in the root element of the content.xml, styles.xml, and settings.xml files. Table 1.2, “Namespaces for OpenOffice.org Documents”, which is adapted from the OpenOffice.org XML File Format reference manual, shows the most important of these.

Table 1.2. Namespaces for OpenOffice.org Documents

Namespace PrefixDescriptionNamespace URI

office

Elements and attributes for common information not found in other namespaces.

http://openoffice.org/2000/office

style

Information about presentation styles, as well as common formatting attributes.

http://openoffice.org/2000/style

text

Elements and attributes used in text documents, as well as other areas where text can be displayed, such as spreadsheet cells and presentations.

http://openoffice.org/2000/text

table

Elements and attributes used in spreadsheets or in tables within text documents.

http://openoffice.org/2000/table

draw

Elements and attributes that describe graphic content; used in drawing and presentation documents.

http://openoffice.org/2000/drawing

number

Elements and attributes that describe number formatting (currency symbol, decimal symbol, etc.).

http://openoffice.org/2000/datastyle

chart

Elements and attributes for chart content.

http://openoffice.org/2000/chart

dr3d

Elements and attributes for 3-d drawing.

http://openoffice.org/2000/dr3d

form

Elements and attributes that describe interactive forms in text and HTML documents.

http://openoffice.org/2000/form

script

Elements and attributes used in macros, as well as in interactive scripting for forms.

http://openoffice.org/2000/script

xlink

The W3C XLink namespace; used for hypertext links, reference to embedded pictures, etc.

http://www.w3.org/1999/xlink

fo

The W3C XSL Formatting Objects namespace; used for specifying font and page layout properties.

http://www.w3.org/1999/XSL/Format

svg

The W3C Scalable Vector Graphics namespace; used for specifying properties and attributes of drawings.

http://www.w3.org/2000/svg

math

The W3C Math Markup Language namespace; used in formula documents.

http://www.w3.org/1998/Math/MathML

dc

The Dublin Core namespace; used in the meta.xml file to describe metadata about the document.

http://purl.org/dc/elements/1.1/

Whenever possible, OpenOffice.org uses existing standards for namespaces. The text namespace adds elements and attributes that describe the aspects of word processing that the fo namespace lacks; similarly draw and dr3d add functionality that is not already found in svg.

If you unzip an OpenOffice.org document, it will unzip into the current directory. If you unpack a second document, your unzip program will either overwrite the old files or prompt you at each file. This is inconvenient, so we have written a Perl program, shown in Example 1.2, “Program to Unpack an OpenOffice.org Document”, which will unpack an OpenOffice.org document whose name has the form filename.extension. It will unzip the files into a directory named filename_extension.

The system() calls in this program are designed for Linux; you will have to modify them to run on Windows or Macintosh OS X.

When you look at the unpacked files in a text editor, you will notice that most of them consist of only two lines: a <!DOCTYPE> declaration followed by a single line containing the rest of the document. Ordinarily this is no problem, as the documents are meant to be read by a program rather than a human. In order to analyze the XML files for this book, we had to put the files in a more readable format. This was easily done by turning off the “Size optimization for XML format (no pretty printing)” checkbox in the Options—Load/Save—General dialog box. All the files we created from that point onward were nicely formatted. If you are receiving files from someone else, and you do not wish to go to the trouble of opening and re-saving each of them, you may use XSLT to do the indenting, as explained in the section called “Using XSLT to Indent OpenOffice.org Documents”.

If you need to pack (or repack) files to produce an OpenOffice.org document, Example 1.3, “Program to Pack Files to Create an OpenOffice.org Document” does exactly that. It takes the files in a directory of the form filename_extension and creates a document named filename.extension (or any other name you wish to give as a second argument on the command line).


Creative Commons License Content licensed under a Creative Commons License.
All content is copyright O’Reilly & Associates, Inc.
During development, I give permission for non-commercial copying for educational and review purposes. After publication, all text will be released under the Free Software Foundation’s GNU Free Documentation License.