Appendix C. Utilities for OpenOffice.org Documents

As we were writing this book, we developed some utilities to make it easier to manipulate OpenOffice.org documents. We hope they are equally useful to you.

OpenOffice.org documents are stored in a JAR format. Rather than having to unjar each document before running an XSLT transformation on it, we wrote this program, which lets you perform a transformation on a member of a JAR file without having to expand it. It also lets you create a JAR file (without a manifest) as output, if your output is intended to be used as an OpenOffice.org document.

Now that we have overcome the problem of the phantom DTD, we can write the main transformation program, OOoTransform.java. It takes the following command line arguments:

Thus, if you are transforming a plain file to another plain file, you might have a command line like this:

To transform the content.xml file inside a document named myfile.sxw, producing a non-compressed output file, you might have a command line like this:

And, to transform content.xml inside a document named myfile.sxw to produce a new content.xml inside a result document named newfile.sxw, your command line would be:

And now, Example C.2, “XSLT Transformation for OpenOffice.org documents”, which shows the code.

Example C.2. XSLT Transformation for OpenOffice.org documents

/*
 * OOoTransform.java
 * (c) 2003-2004 J. David Eisenberg
 * Licensed under LGPL
 *
 * Program purpose: to perform an XSLT transformation
 * on a member of an OpenOffice.org document, either
 * after unzipping or while still in its zipped state.
 * Output may go to a normal file or a zipped file.
 */

import javax.xml.transform.TransformerFactory;
import javax.xml.transform.Transformer;
import javax.xml.transform.sax.SAXSource;
import javax.xml.transform.sax.SAXResult;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerConfigurationException;

import org.xml.sax.XMLReader;
import org.xml.sax.InputSource;
import org.xml.sax.ContentHandler;
import org.xml.sax.ext.LexicalHandler;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.XMLReaderFactory;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;

import java.util.jar.JarInputStream;
import java.util.jar.JarOutputStream;
import java.util.jar.JarEntry;
import java.util.Vector;
import java.util.zip.ZipException;

public class OOoTransform
{
    String  inputFileName = null;   // input file name, or member name...
    String  inputOOoName = null;    // ...if an OOo input file is given
    String  outputFileName = null;  // output file name, or member name...
    String  outputOOoName = null;   // ...if an OOo output file is given
    String  xsltFileName = null;    // XSLT file is always a regular file

    Vector  params = new Vector();  // parameters to be passed to transform

    public OOoTransform( )
    {
        /* I thought I needed a constructor here */
    }
    
    public void doTransform( )
    throws TransformerException, TransformerConfigurationException, 
         SAXException, ZipException, IOException       
    {
        /* Set up the XSLT transformation based on the XSLT file */
        File xsltFile = new File( xsltFileName );
        StreamSource streamSource = new StreamSource( xsltFile );
        TransformerFactory tFactory = TransformerFactory.newInstance(); 
        Transformer transformer = tFactory.newTransformer( streamSource );

        /* Set up parameters for transform */
        for (int i=0; i < params.size(); i += 2)
        {
            transformer.setParameter((String) params.elementAt(i),
                (String) params.elementAt(i + 1));
        }

        /* Create an XML reader which will ignore the <!DOCTYPE office.dtd> */
        XMLReader reader = XMLReaderFactory.createXMLReader();
        reader.setEntityResolver( new ResolveOfficeDTD() );
        
        InputSource inputSource;

        if (inputOOoName == null)
        {
            /* This is an unpacked file. */
            inputSource =
                new InputSource( new FileInputStream( inputFileName ) );
        }
        else
        {
            /* The input file should be a member of an OOo file.
               Check to see if the input file name really exists
               within the JAR file */
            JarInputStream jarStream =
                new JarInputStream( new FileInputStream( inputOOoName ),
                    false );
            JarEntry jarEntry;
            while ( (jarEntry = jarStream.getNextJarEntry() ) != null &&
                !(inputFileName.equals(jarEntry.getName()) ) )
                // do nothing
                ;
            inputSource = new InputSource( jarStream );
        }
        
        SAXSource saxSource = new SAXSource( reader, inputSource );
        saxSource.setSystemId( inputFileName );

        if (outputOOoName == null)
        {
            /* We want a regular file as output */
            FileOutputStream outputStream =
                new FileOutputStream( outputFileName );
            transformer.transform( saxSource, 
                new StreamResult( outputStream ) );
        }
        else
        {
            /* The output file name is the name of a member of
               a JAR file (which we will build without a manifest) */
            JarOutputStream jarStream =
                new JarOutputStream( new FileOutputStream( outputOOoName ) );
            JarEntry jarEntry = new JarEntry( outputFileName );
            jarStream.putNextEntry( jarEntry );
            transformer.transform( saxSource, 
                new StreamResult( jarStream ) );
                
            /* You must close the member file and the JAR file
               to complete the file */
            jarStream.closeEntry();
            jarStream.close();
        }
    }

    /* Check to see if the command line arguments make sense */
    public void checkArgs( String[] args )
    {
        int     i;
        
        if (args.length == 0)
        {
            showUsage( );
            System.exit( 1 );
        }
        i = 0;
        while ( i < args.length )
        {
            if (args[i].equalsIgnoreCase("-in"))
            {
                if ( i+1 >= args.length)
                {
                    badParam("-in");
                }
                inputFileName = args[i+1];
                i += 2;
            }
            else if (args[i].equalsIgnoreCase("-out"))
            {
                if ( i+1 >= args.length)
                {
                    badParam("-out");
                }
                outputFileName = args[i+1];
                i += 2;
            }
            else if (args[i].equalsIgnoreCase("-xsl"))
            {
                if ( i+1 >= args.length)
                {
                    badParam("-xsl");
                }
                xsltFileName = args[i+1];
                i += 2;
            }
            else if (args[i].equalsIgnoreCase("-inooo"))
            {
                if ( i+1 >= args.length)
                {
                    badParam("-inOOo");
                }
                inputOOoName = args[i+1];
                i += 2;
            }
            else if (args[i].equalsIgnoreCase("-outooo"))
            {
                if ( i+1 >= args.length)
                {
                    badParam("-outOOo");
                }
                outputOOoName = args[i+1];
                i += 2;
            }
            else if (args[i].equalsIgnoreCase("-param"))
            {
                if ( i+2 >= args.length)
                {
                    badParam("-param");
                }
                params.addElement( args[i+1] );
                params.addElement( args[i+2] );
                i += 3;
            }
            else
            {
                System.out.println( "Unknown argument " + args[i] );
                System.exit( 1 );
            }
        }
        
        if (inputFileName == null)
        {
            System.out.println("No input file name specified.");
            System.exit( 1 );
        }
        if (outputFileName == null)
        {
            System.out.println("No output file name specified.");
            System.exit( 1 );
        }
        if (xsltFileName == null)
        {
            System.out.println("No XSLT file name specified.");
            System.exit( 1 );
        }
    }

    /* If not enough arguments for a parameter, show error and exit */
    public void badParam( String paramName )
    {
        System.out.println("Not enough parameters to " + paramName);
        System.exit(1);
    }
    
    /* If no arguments are provided, show this brief help section */
    public void showUsage( )
    {
        System.out.println("Usage: OOoTransform options");
        System.out.println("Options:");
        System.out.println("   -in inputFilename");
        System.out.println("   -xsl transformFilename");
        System.out.println("   -out outputFilename");
        System.out.println("If the input filename is within an OOo file, then:");
        System.out.println("   -inOOo inputOOoFileName");
        System.out.println("If you wish to output an OOo file, then:");
        System.out.println("   -outOOo outputOOoFileName");
        System.out.println( );
        System.out.println("Argument names are case-insensitive.");
    }

    public static void main(String[] args)
    {
        OOoTransform transformApp = new OOoTransform( );
        transformApp.checkArgs( args );
        try {
            transformApp.doTransform( );
        }
        catch (Exception e)
        {
            System.out.println("Unable to transform");
            System.out.println(e.getMessage( ));
        }
    }
}

As an application of the preceding script, we present an alternate method of indenting the unpacked files via a simple XSLT transformation. Example C.4, “XSLT Transformation for Indenting” shows this transformation, which simply copies the entire document tree while setting indent to yes in the <xsl:output> element.

We now present a Perl program to invoke this transformation on all the XML files in an unpacked OpenOffice.org document. We will need to set two paths: one to the transformation script, and one to the location of the preceding XSLT transformation.

This process may insert newlines in text as well as between elements. In cases where elements contain other elements, this is not a problem, as OpenOffice.org ignores whitespace between elements. When expanding text elements, though, the extra newlines could cause extra spaces to appear when repacking the document. Thus, you should use this method to indent the XML document only when you do not to repack the resulting files.

When using XLST with OpenOffice.org documents, you will want to make sure you have declared all the appropriate namespaces. Rather than selecting exactly the namespaces that your document uses, we provide all of the namespaces for OpenOffice.org in Example C.6, “XSLT Framework for Transforming OpenOffice.org Documents”, which you may use as a framework for your transformations.

Example C.6. XSLT Framework for Transforming OpenOffice.org Documents

<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:accel="http://openoffice.org/2001/accel" 
    xmlns:chart="http://openoffice.org/2000/chart" 
    xmlns:config="http://openoffice.org/2001/config" 
    xmlns:dc="http://purl.org/dc/elements/1.1/" 
    xmlns:dlg="http://openoffice.org/2000/dialog" 
    xmlns:dr3d="http://openoffice.org/2000/dr3d" 
    xmlns:draw="http://openoffice.org/2000/drawing" 
    xmlns:event="http://openoffice.org/2001/event" 
    xmlns:fo="http://www.w3.org/1999/XSL/Format" 
    xmlns:form="http://openoffice.org/2000/form" 
    xmlns:image="http://openoffice.org/2001/image" 
    xmlns:library="http://openoffice.org/2000/library" 
    xmlns:manifest="http://openoffice.org/2001/manifest" 
    xmlns:math="http://www.w3.org/1998/Math/MathML"
    xmlns:menu="http://openoffice.org/2001/menu" 
    xmlns:meta="http://openoffice.org/2000/meta" 
    xmlns:number="http://openoffice.org/2000/datastyle" 
    xmlns:office="http://openoffice.org/2000/office" 
    xmlns:script="http://openoffice.org/2000/script" 
    xmlns:statusbar="http://openoffice.org/2001/statusbar" 
    xmlns:style="http://openoffice.org/2000/style" 
    xmlns:svg="http://www.w3.org/2000/svg" 
    xmlns:table="http://openoffice.org/2000/table" 
    xmlns:text="http://openoffice.org/2000/text" 
    xmlns:toolbar="http://openoffice.org/2001/toolbar" 
    xmlns:xlink="http://www.w3.org/1999/xlink" 
>

<xsl:output method="xml"
    doctype-public="-//OpenOffice.org//DTD OfficeDocument 1.0//EN"
    doctype-system="office.dtd"/>

<xsl:template match="/office:document-content">
    <xsl:apply-templates/>
</xsl:template>

</xsl:stylesheet>

If you are creating an OpenOffice.org document from a file where white space has been preserved, you will have to convert runs of spaces into <text:s> elements, and convert tabs and line feeds into <text:tab-stop> and <text:line-break> elements. This task is not easily done in native XSLT; Example C.7, “Transforming Whitespace to OpenOffice.org XML” is a Java extension for Xalan which will do what you need. You will note that we create elements and attributes complete with namespace prefix. This is certainly not a recommended practice, but createElementNS() and setAttributeNS() create xmlns attributes rather than a prefixed name.

Example C.7. Transforming Whitespace to OpenOffice.org XML

import java.util.*;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.Text;
import org.apache.xpath.NodeSet;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;

public class OOoWhiteSpace {

    public OOoWhiteSpace () 
    {}

    public static NodeList compressString( String str )
    {
        OOoWhiteSpace whiteSpace = new OOoWhiteSpace();
        return whiteSpace.doCompress( str );
    }

    private Document tempDoc;       // necessary for creating elements
    private StringBuffer strBuf;    // where non-whitespace accumulates
    private NodeSet resultSet;      // the value to be returned
    private int pos;                // current position in string
    private int startPos;           // where blanks begin accumulating
    private int nSpaces;            // number of consecutive spaces
    private boolean inSpaces;       // handling spaces?
    private char ch;                // current character in buffer
    private char prevChar;          // previous character in buffer
    private Element element;        // element to be added to node list

    public NodeList doCompress( String str )
    {  
        if (str.length() == 0)
        {
            return null;
        }

        tempDoc = null;
        strBuf = new StringBuffer( str.length() );
    
        try
        {
            tempDoc = DocumentBuilderFactory.newInstance().
                newDocumentBuilder().newDocument();
        }
        catch(ParserConfigurationException pce)
        {
            return null;
        }
 
        resultSet = new NodeSet();
        resultSet.setShouldCacheNodes(true);
        
        pos = 0;
        startPos = 0;
        nSpaces = 0;
        inSpaces = false;
        ch = '\u0000';

        while (pos < str.length())
        {
            prevChar = ch;
            ch = str.charAt( pos );
            if (ch == ' ')
            {
                if (inSpaces)
                {
                    nSpaces++;
                }
                else
                {
                    emitText( );
                    nSpaces = 1;
                    inSpaces = true;
                    startPos = pos;
                }
            }
            else if (ch == 0x000a || ch == 0x000d)
            {
                if (prevChar != 0x000d) // ignore LF or CR after CR.
                {
                    emitPending( );
                    element = tempDoc.createElement("text:line-break");
                    resultSet.addNode(element);
                }      
            }
            else if (ch == 0x09)
            {
                emitPending( );
                element = tempDoc.createElement("text:tab-stop");
                resultSet.addNode(element);
            }
            else
            {
                if (inSpaces){ emitSpaces( ); }
                strBuf.append( ch );
            }
            pos++;
        }
        
        emitPending( );     // empty out anything that's accumulated
        
        return resultSet;
    }
    
    /**
     * Emit accumulated spaces or text
     */
    private void emitPending( )
    {
        if (inSpaces)
        {
            emitSpaces( );
        }
        else
        {
            emitText( );
        }
    }

    /*
     * Emit accumulated text.
     * Creates a text node with currently accumulated text.
     * Side effect: empties accumulated text buffer
     */
    private void emitText( )
    {
        if (strBuf.length() != 0)
        {
            Text textNode = tempDoc.createTextNode( strBuf.toString( ) );
            resultSet.addNode( textNode );
            strBuf = new StringBuffer( );
        }
    }
    
    /*
     * Emit accumulated spaces.
     * If these are leading blanks, emit only a
     * <text:s> element; otherwise a blank plus
     * a <text:s> element (if necessary)
     * Side effect: sets accumulated number of spaces to zero.
     * Side effect: sets "inSpaces" flag to false
     */
    private void emitSpaces( )
    {
        if (nSpaces != 0)
        {
            if (startPos != 0)
            {
                Text textNode = tempDoc.createTextNode( " " );
                resultSet.addNode( textNode );
                nSpaces--;
            }

            if (nSpaces >= 1 || startPos == 0)
            {
                element = tempDoc.createElement( "text:s" );
                element.setAttribute( "text:c", 
                    (new Integer(nSpaces)).toString( ) );
                resultSet.addNode( element );
            }

            inSpaces = false;
            nSpaces = 0;
        }
    }
}

This is the same program as Example 2.3, “Program show_meta.pl”, except that it uses the XML::SAX module instead of XML::Simple. XML::SAX is a perl module for the Simple API for XML, which interfaces to an event-driven parser. The parser issues many kinds of events as it parses a document; the ones we are interested in are the events that occur when an element starts, when it ends, and when we encounter the element’s text content. To use XML::SAX, you must specify a handler object, which is a Perl package that contains subroutines that are called when the parser detects events. The handler subroutines receive two parameters: a reference to the parser, and data hash with information about the event. Here are the subroutines that we will implement, the keys from the data hash that we are interested in, and how we will use their values.

start_element

This subroutine is called whenever the parser detects an opening tag for an element. The relevant keys are:

Name
The name of the element (with namespace prefix)
Attributes
The value of this key is yet another hash, whose keys are the attribute names, preceded by their namespace URIs. This value for each of these keys is yet another hash, with keys Name and Value, whose values are the attribute name and value.

The program will store the element name in a scalar $element and the attributes in a global array @attributes. It sets a global scalar $text to the null string; this variable will be used to collect all the element’s text content.

characters

This subroutine is called whenever the parser detects a series of characters within an element. The relevant key is:

Data
The characters that have been parsed.

The text is concatenated to the end of the $text variable. This is necessary because a single sequence of text may generate multiple calls to the characater handler.

end_element

This subroutine is called whenever the parser detects an opening tag for an element. The relevant key is:

Name
The name of the element (with namespace prefix)

Upon encountering the end of an element, the program will get add the element name as a key in a hash named %info. The hash value will be an anonymous array consisting of the $text content followed by the @attributes array.

Here is the rewritten program.

Example C.8. Program sax_show_meta.pl

#!/usr/bin/perl

#
#   Show meta-information in an OpenOffice.org document.
#
use XML::SAX;
use IO::File;
use Text::Wrap;
use Carp;
use strict 'vars';

my $suffix;     # file suffix

my $parser;     # instance of XML::SAX parser
my $handler;    # module that handles elements, etc.
my $filehandle; # file handle for piped input

my $info;       # the hash returned from the parser
my @attributes; # attributes from a returned element
my %attr_hash;  # hash of attribute names and values
#
#   Check for one argument: the name of the OpenOffice.org document
#
if (scalar @ARGV != 1)
{
    croak("Usage: $0 document");
}

#
#   Get file suffix for later reference
#
($suffix) = $ARGV[0] =~ m/\.(\w\w\w)$/;

#
#   Create an object containing handlers for relevant events.
#
$handler = MetaElementHandler->new();


#
#   Create a parser and tell it where to find the handlers.
#
$parser =
    XML::SAX::ParserFactory->parser( Handler => $handler);

#
#   Input to the parser comes from the output of member_read.pl
# 
$ARGV[0] =~ s/[;|'"]//g;  #eliminate dangerous shell metacharacters     
$filehandle = IO::File->new( "perl member_read.pl $ARGV[0] meta.xml |" ); 1

#
#   Parse and collect information.
#
$parser->parse_file( $filehandle );

#
#   Retrieve the information collected by the parser
#
$info = $handler->get_info();  2

#
#   Output phase
#
print "Title:       $info->{'dc:title'}[0]\n"
    if ($info->{'dc:title'}[0]);
print "Subject:     $info->{'dc:subject'}[0]\n"
    if ($info->{'dc:subject'}[0]);

if ($info->{'dc:description'}[0])
{
    print "Description:\n";
    $Text::Wrap::columns = 60;
    print wrap("\t", "\t", $info->{'dc:description'}[0]), "\n";
}

print "Created:     ";
print format_date($info->{'meta:creation-date'}[0]);
print " by $info->{'meta:initial-creator'}[0]"
    if ($info->{'meta:initial-creator'}[0]);
print "\n";

print "Last edit:   ";
print format_date($info->{"dc:date"}[0]);
print " by $info->{'dc:creator'}[0]"
    if ($info->{'dc:creator'}[0]);
print "\n";

#
#   Take attributes from the meta:document-statistic element
#   (if any) and put them into %attr_hash
#
@attributes = @{$info->{'meta:document-statistic'}};

if (scalar(@attributes) > 1)
{
    shift @attributes;
    %attr_hash = @attributes;

    if ($suffix eq "sxw")
    {
        print "Pages:       $attr_hash{'meta:page-count'}\n";
        print "Words:       $attr_hash{'meta:word-count'}\n";
        print "Tables:      $attr_hash{'meta:table-count'}\n";
        print "Images:      $attr_hash{'meta:image-count'}\n";
    }
    elsif ($suffix eq "sxc")
    {
        print "Sheets:      $attr_hash{'meta:table-count'}\n";
        print "Cells:       $attr_hash{'meta:cell-count'}\n"
            if ($attr_hash{'meta:cell-count'});
    }
}

#
#   A convenience subroutine to make dates look
#   prettier than ISO-8601 format.
#
sub format_date
{
    my $date = shift;
    my ($year, $month, $day, $hr, $min, $sec);
    my @monthlist = qw (Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);
    
    ($year, $month, $day, $hr, $min, $sec) =
        $date =~ m/(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}):(\d{2})/;
    return "$hr:$min on $day $monthlist[$month-1] $year";
}


package MetaElementHandler; 3

my %element_info;   # the data structure that we are creating
my $element;        # name of element being processed
my @attributes;     # attributes for this element
my $text;           # text content of the element


sub new { 4
    my $class = shift;
    my %opts = @_;
    bless \%opts, $class;
}

sub reset {
    my $self = shift;
    %$self = ();
}

#
#   Store current element and its attribute.
#
sub start_element
{
    my ($self, $parser_data) = @_;
    
    my $hashref; 5
    my $item;       # loop control variable

    $element = $parser_data->{"Name"};

    foreach $item (keys %{$parser_data->{"Attributes"}})
    {
        $hashref =  $parser_data->{"Attributes"}{$item};
        push @attributes, $hashref->{"Name"},  $hashref->{"Value"};
    }
    
    $text = ""; # no text content yet.
}

#
#   Create an entry into a hash for the element that is ending
#
sub end_element
{
    my ($self, $parser_data) = @_;

    $element = $parser_data->{"Name"};
    $element_info{$element} = [$text, @attributes];
}

#
#   Accumulate element's text content.
#
sub characters
{
    my ($self, $parser_data) = @_;
    $text .= $parser_data->{"Data"}; 6
}

#   Return a reference to the %info hash 
#
sub get_info 7
{
    my $self = shift;
    return \%element_info;
}
1

XML::SAX doesn’t read from file handles opened with the standard Perl open function; you have to use IO::File to create the file handle.

2

The handler object has accumulated all the information from the meta.xml file into a hash. We ask the handler to return a reference to that hash.

3

XML::SAX wants its handler subroutines to be in a Perl object. The package statement serves to “encapsulate” the variables and subroutines; as good citizens, we don’t directly access any of the variables from the main program.

4

The new subroutine completes the work of making this package into a Perl object. The reset subroutine is for XML::SAX’s internal use.

5

The $hashref variable is here for convenience; if we didn’t use it, the push statement would be even less readable than it already is.

6

Note the .= operation; since the text inside an element can come from many calls to characters, we have to concatentate them all.

7

This is not an XML::SAX routine; we are providing it so that we can hand a reference to our accumulated data back to the main program.


Creative Commons License Content licensed under a Creative Commons License.
All content is copyright O’Reilly & Associates, Inc.
During development, I give permission for non-commercial copying for educational and review purposes. After publication, all text will be released under the Free Software Foundation’s GNU Free Documentation License.