Table of Contents
In this chapter, we will discuss not only the “what” of the OpenOffice.org file format, but also the “why.” Thus, this chapter is as much evangelism as explanation.
Before we can talk about OpenOffice.org, we have to look at the current state of proprietary office suites and applications. In this world, all your documents are stored in a proprietary (often binary) format. As long as you stay within the office suite, this is not a problem. You can transfer data from one part of the suite to another; you can transfer text from the word processor to a presentation, or you can grab a set of numbers from the spreadsheet and convert it to a table in your word processing document.
The problems begin when you want to do a transfer that wasn’t intended by the authors of the office suite. Because the internal structure of the data is unknown to you, you can’t write a program that creates a new word processing document consisting of all the headings from a different document. If you need to do something that wasn’t provided by the software vendor, or if you must process the data with an application external to the office suite, you will have to convert that data to some neutral or “universal” format such as Rich Text Format (RTF) or comma-separated values (CSV) for import into the other applications. You have to rely on the kindness of strangers to include these conversions in the first place. Furthermore, some conversions can result in loss of formatting information that was stored with your data.
Note also that your data can become inaccessible when the software vendor moves to a new internal format and stops supporting your current version. (Some people actually suggest that this is not cause for complaint since, by putting your data into the vendor’s proprietary format, the vendor has now become a co-owner of your data. This is, and I mean this in the nicest possible way, a dangerously idiotic idea.)
OpenOffice.org has as its mission “[t]o create, as a community, the leading international office suite that will run on all major platforms and provide access to all functionality and data through open-component based APIs and an XML-based file format.”
The OpenOffice.org file format is not simply an XML wrapper for a binary format, nor a one-to-one correspondence between the XML tags and the internal data structures. Instead, it is an idealized representation of the structure. This allows future versions of OpenOffice.org to implement new features or completely alter internal data structures without requiring major changes to the file format. You can see the full details of this design decision at http://xml.openoffice.org/xml_advocacy.html
Although the XML file format is human-readable, it is fairly verbose. To save space, OpenOffice.org files are stored in JAR (Java Archive) format. A JAR file is a compressed ZIP file that has an additional “manifest” file that lists the contents of the archive. Since all JAR files are also ZIP files, you may use any ZIP file tool to unpack the OpenOffice.org document and read the XML directly. Figure 1.1, “Text Document” shows a short word processing document, which we have saved with the name firstdoc.sxw.
Example 1.1, “Listing of Unzipped Text Document” shows the results of unzipping this file in Linux; the date, time, and CRC columns have been edited out to save horizontal space.
Example 1.1. Listing of Unzipped Text Document
[david@penguin ch01]$ unzip -v firstdoc.sxw
Archive: firstdoc.sxw
Length Method Size Ratio Name
-------- ------ ------- ----- ----
30 Stored 30 0% mimetype
2642 Defl:N 738 72% content.xml
4797 Defl:N 1217 75% styles.xml
1128 Stored 1128 0% meta.xml
6486 Defl:N 1391 79% settings.xml
752 Defl:N 254 66% META-INF/manifest.xml
-------- ------- --- -------
15835 4758 70% 6 files
These files are, in order:
This file has a single line of text which gives the MIME type for the document.The various MIME types are summarized in Table 1.1, “MIME Types for OpenOffice.org Documents”.
The actual content of the document
This file contains information about the styles used in the content. The content and style information are in different files on purpose; separating content from presentation provides more flexibility.
Meta-information about the content of the document (such things as author, last revision date, etc.) This is different from the META-INF directory.
This file contains information that is specific to the application. Some of this information, such as window size/position and printer settings is common to most documents. A text document would have information such as zoom factor, whether headers and footers are visible, etc. A spreadsheet would contain information about whether column headers are visible, whether cells with a value of zero should show the zero or be empty, etc.
This file gives a list of all the other files in the JAR. This is meta-information about the entire JAR file.
Table 1.1. MIME Types for OpenOffice.org Documents
| Document Type | MIME Type |
|---|---|
Text | application/vnd.sun.xml.writer |
Spreadsheet | application/vnd.sun.xml.calc |
Drawing | application/vnd.sun.xml.draw |
Presentation | application/vnd.sun.xml.impress |
Chart | application/vnd.sun.xml.chart |
Formula | application/vnd.sun.xml.math |
Master Document | application/vnd.sun.xml.writer.global |
We will discuss the meta.xml, settings.xml, and style.xml files in greater detail in the next chapter, and the remainder of the book will cover the various flavors of the content.xml file.
First, let’s look at the contents of manifest.xml, most of which is self-explanatory.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE manifest:manifest PUBLIC "-//OpenOffice.org//DTD Manifest
1.0//EN" "Manifest.dtd">
<manifest:manifest xmlns:manifest="http://openoffice.org/2001/manifest">
<manifest:file-entry manifest:media-type="application/vnd.sun.xml.writer"
manifest:full-path="/" />
<manifest:file-entry manifest:media-type=""
manifest:full-path="Pictures/" />
<manifest:file-entry manifest:media-type="text/xml"
manifest:full-path="content.xml" />
<manifest:file-entry manifest:media-type="text/xml"
manifest:full-path="styles.xml" />
<manifest:file-entry manifest:media-type="text/xml"
manifest:full-path="meta.xml" />
<manifest:file-entry manifest:media-type="text/xml"
manifest:full-path="settings.xml" />
</manifest:manifest>
The manifest:media-type for the root directory tells what kind of file this is. Its content is the same as the content of the mimetype file, as shown in Table 1.1, “MIME Types for OpenOffice.org Documents”.
There is an entry for a Pictures directory, even though there are no images in the file. If there were an image, the unzipped file would contain a Pictures directory, and the relevant portion of the manifest would now look like this:
<manifest:file-entry manifest:media-type="image/png"
manifest:full-path="Pictures/100002000000002000000020DF8717E9.png" />
<manifest:file-entry manifest:media-type=""
manifest:full-path="Pictures/" />
If you have included OpenOffice.org BASIC scripts, your .jar file will include a Basic directory, and the manifest will describe it and its contents.
If you are building your own document with embedded objects (charts, pictures, etc.) you must keep track of them in the manifest file, or OpenOffice.org will not be able to find them.
The manifest.xml used the manifest namespace for all of its element and attribute names. OpenOffice.org uses a large number of namespace declarations in the root element of the content.xml, styles.xml, and settings.xml files. Table 1.2, “Namespaces for OpenOffice.org Documents”, which is adapted from the OpenOffice.org XML File Format reference manual, shows the most important of these.
Table 1.2. Namespaces for OpenOffice.org Documents
| Namespace Prefix | Description | Namespace URI |
|---|---|---|
office | Elements and attributes for common information not found in other namespaces. | http://openoffice.org/2000/office |
style | Information about presentation styles, as well as common formatting attributes. | http://openoffice.org/2000/style |
text | Elements and attributes used in text documents, as well as other areas where text can be displayed, such as spreadsheet cells and presentations. | http://openoffice.org/2000/text |
table | Elements and attributes used in spreadsheets or in tables within text documents. | http://openoffice.org/2000/table |
draw | Elements and attributes that describe graphic content; used in drawing and presentation documents. | http://openoffice.org/2000/drawing |
number | Elements and attributes that describe number formatting (currency symbol, decimal symbol, etc.). | http://openoffice.org/2000/datastyle |
chart | Elements and attributes for chart content. | http://openoffice.org/2000/chart |
dr3d | Elements and attributes for 3-d drawing. | http://openoffice.org/2000/dr3d |
form | Elements and attributes that describe interactive forms in text and HTML documents. | http://openoffice.org/2000/form |
script | Elements and attributes used in macros, as well as in interactive scripting for forms. | http://openoffice.org/2000/script |
xlink | The W3C XLink namespace; used for hypertext links, reference to embedded pictures, etc. | http://www.w3.org/1999/xlink |
fo | The W3C XSL Formatting Objects namespace; used for specifying font and page layout properties. | http://www.w3.org/1999/XSL/Format |
svg | The W3C Scalable Vector Graphics namespace; used for specifying properties and attributes of drawings. | http://www.w3.org/2000/svg |
math | The W3C Math Markup Language namespace; used in formula documents. | http://www.w3.org/1998/Math/MathML |
dc | The Dublin Core namespace; used in the meta.xml file to describe metadata about the document. | http://purl.org/dc/elements/1.1/ |
Whenever possible, OpenOffice.org uses existing standards for namespaces. The text namespace adds elements and attributes that describe the aspects of word processing that the fo namespace lacks; similarly draw and dr3d add functionality that is not already found in svg.
If you unzip an OpenOffice.org document, it will unzip into the current directory. If you unpack a second document, your unzip program will either overwrite the old files or prompt you at each file. This is inconvenient, so we have written a Perl program, shown in Example 1.2, “Program to Unpack an OpenOffice.org Document”, which will unpack an OpenOffice.org document whose name has the form filename.extension. It will unzip the files into a directory named filename_extension.
The system() calls in this program are designed for Linux; you will have to modify them to run on Windows or Macintosh OS X.
Example 1.2. Program to Unpack an OpenOffice.org Document
#!/usr/bin/perl
#
# Unpack an OpenOffice.org file to a
# directory.
#
# Archive::Zip is used to unzip files.
# File::Path is used to create and remove directories.
#
use Archive::Zip;
use File::Path;
use strict;
my $file_name;
my $dir_name;
my $suffix;
my $zip;
my $member_name;
my @member_list;
if (scalar @ARGV != 1)
{
print "Usage: $0 filename\n";
exit;
}
$file_name = $ARGV[0];
#
# Only allow filenames ending with:
# .sxw text
# .sxc spreadsheet
# .sxi presentation
# .sxd drawing
# .sxg master
# .sxm formula
if ($file_name !~ m/\.s(([xt][wcidm])|(xg))/)
{
print "This does not appear to be an OpenOffice.org file.\n";
print "Legal suffixes are .sxw, .sxc, .sxi, .sxd, .sxm, .sxg,\n";
print ".stw, .stc, .sti, .std, and .stm\n";
exit;
}
$suffix = $1;
#
# Create directory name based on filename
#
($dir_name = $file_name) =~ s/(\.s$suffix)//;
$dir_name .= "_s$suffix";
#
# Forcibly remove old directory, re-create it,
# and unzip the OpenOffice.org file into that directory
#
rmtree($dir_name, 0, 0);
mkpath($dir_name, 0, 0755);
$zip = Archive::Zip->new( $file_name );
@member_list = $zip->memberNames( );
foreach $member_name (@member_list)
{
$zip->extractMember( $member_name, "$dir_name/$member_name" );
}
print "$file_name unpacked.\n";
When you look at the unpacked files in a text editor, you will notice that most of them consist of only two lines: a <!DOCTYPE> declaration followed by a single line containing the rest of the document. Ordinarily this is no problem, as the documents are meant to be read by a program rather than a human. In order to analyze the XML files for this book, we had to put the files in a more readable format. This was easily done by turning off the “Size optimization for XML format (no pretty printing)” checkbox in the Options—Load/Save—General dialog box. All the files we created from that point onward were nicely formatted. If you are receiving files from someone else, and you do not wish to go to the trouble of opening and re-saving each of them, you may use XSLT to do the indenting, as explained in the section called “Using XSLT to Indent OpenOffice.org Documents”.
If you need to pack (or repack) files to produce an OpenOffice.org document, Example 1.3, “Program to Pack Files to Create an OpenOffice.org Document” does exactly that. It takes the files in a directory of the form filename_extension and creates a document named filename.extension (or any other name you wish to give as a second argument on the command line).
Example 1.3. Program to Pack Files to Create an OpenOffice.org Document
#!/usr/bin/perl
#
# Repack a directory to an OpenOffice.org file
#
# Directory xyz_sxw will be packed into xyz.sxw, etc.
#
if (scalar @ARGV < 1 || scalar @ARGV > 2)
{
print "Usage: $0 directoryname [newfilename]\n";
exit;
}
$dir_name = $ARGV[0];
#
# If no new filename is given, create a filename
# based on directory name
#
if ($ARGV[1])
{
$file_name = $ARGV[1];
}
else
{
if ($dir_name !~ m/_s(([xt][wcidm])|(xg))/)
{
print "This does not appear to be an unpacked OpenOffice.org file.\n";
print "Legal suffixes are _sxw, _sxc, _sxi, _sxd, _sxm, _sxg,\n";
print "_stw, _stc, _sti, _std, and _stm.\n";
exit;
}
$suffix = $1;
($file_name = $dir_name) =~ s/(_s$suffix)//;
$file_name .= ".s$suffix";
}
$current_dir = `pwd`;
if (!chdir($dir_name))
{
print "Cannot change to directory $dir_name\n";
exit;
}
system("zip -r ../$file_name *");
chdir($current_dir);
print "$dir_name packed to $file_name.\n";
If you would rather not go to the trouble of unpacking and packing a document to view its XML, you may use a program called SAXEcho, which communicates with OpenOffice.org to load the XML from the currently open document. You may then edit the XML in either tree view or text view; then you may save it to a new document.
In brief, you must run SAXEcho with your classpath containing all the jar files in the program/classes directory in the OpenOffice.org installation directory. You must then start OpenOffice.org from the command line with the following switches:
soffice -accept=socket,host=localhost,port=2002;urp;
The full details of SAXEcho may be found at http://xml.openoffice.org/saxecho/.
As you begin to work with OpenOffice.org’s XML files, you may want to write a program that constructs a document with some feature that isn’t explained in this book—this is, after all, an “essentials” book. Just start OpenOffice.org, create a document that has the feature you want, unpack the file, and look for the XML that implements it. To get a better understanding of how things works, change the XML, repack the document, and reload it. Once you know how a feature works, don’t hesitate to copy and paste the XML from the OpenOffice.org file into your program. In other words, cheat. It worked for me when I was writing this book, and it can work for you too!
Content licensed under a
Creative Commons
License.