Chapter 3. Text Document Basics

At this point we are ready to look at the specifics of the content.xml file for word processing documents. We will build up from the most basic elements, characters and paragraphs, to sections and pages. This chapter also covers the topic of lists and outlines in OpenOffice.org word processing documents.

All OpenOffice.org documents are based on Unicode, and are encoded in the UTF-8 encoding scheme. You may see a discussion of this at the section called “Unicode Encoding Schemes”. This means that you may freely mix characters from a variety of languages in an OpenOffice.org document, as shown in Figure 3.1, “Document with Mixed Languages”. It also means that those characters will not be easily viewable in a normal ASCII text editor.

In XML, whitespace in element content is typically not preserved unless specially designated. OpenOffice.org collapses consecutive whitespace characters, which are defined as space (0x0020), tab (0x0009), carriage return (0x000D), and line feed (0x000A) to a single space. How, then, does OpenOffice.org represent a document where whitespace is significant?

To handle extra spaces, OpenOffice.org uses the <text:s> element. This empty element has an optional attribute, text:c, which tells how many spaces occur in the document. If this attribute is absent, then the element inserts one space. Between words, the <text:s> element is used to describe spaces after the first one; thus, for a single space, you don’t need this element. At the beginning of a line, you do need the <text:s>, since OpenOffice.org eliminates leading whitespace immediately after a starting tag.

Tab stops are represented by the empty <text:tab-stop> element, and a line break, which is entered in OpenOffice.org by pressing Shift-Enter, is represented by the empty <text:line-break> element. Example 3.1, “Representation of Whitespace” shows these elements in action.

If you are using XSLT to extract the contents of an OpenOffice.org document to a plain text file, you may want to expand these elements into their original whitespace. Example 3.2, “XSLT Templates for Expanding Whitespace” shows the XSLT templates required to do this. The templates for <text:tab-stop> and <text:line-break> are easy; we just emit the proper Unicode value. The code becomes slightly complex when we get to <text:s>, because we need to be able to handle an arbitrary number of spaces. Here’s the pseudocode:

  • Create a variable named spaces, which contains 30 spaces. Remember to use the xml:space="preserve" attribute to prevent Xalan from “helpfully” collapsing this whitespace.
  • If the <text:s> doesn’t have a text:c attribute, simply emit one blank.
  • If there is a text:c attribute, call a template named insert-spaces and pass the number of spaces in as a parameter named n.
  • insert-spaces tests to see if $n is less than or equal to 30. If so, then the template emits that many spaces as a substring from the $spaces variable.
  • If there are more than 30 spaces required, insert-spaces emits the entire $spaces variable, and then calls itself with $n minus 30 as the new number of spaces to emit.

If you are creating an OpenOffice.org document from a source where whitespace has been preserved, you must reverse this process, creating the appropriate <text:s>, <text:tab-stop>, and <text:line-break> elements. While there may be a simple and clever way of doing this conversion in XSLT, it eludes this author entirely. The straightforward approach of looking at the input string character-by-character is totally unsuited to the XSLT processing model, so we have created a Java extension function, which you may find in the section called “OpenOffice.org White Space Representation”. Example 3.3, “Test XML file for Whitespace Conversion” shows a section of a test XML file, and Example 3.4, “Test XML file for Whitespace Conversion” shows part of the XSLT that calls the extension function.

1 We are using the abbreviated format for Xalan extensions written in Java. The xmlns:java describes the path name to the extension. We have placed the OOoWhiteSpace.class file in the same directory as the transformation program, so the fully qualified class name is just the class name; the exclude-result-prefixes ensures that the java prefix doesn’t appear in the transformation output.
2 This provides the bare bones for the document we are creating; we are declaring only the namespaces that are required in the resulting document.
3 When you call the extension function, it returns a set of nodes that represent your input string in OpenOffice.org format. If you use <xsl:copy-of>, the entire set will be copied to the output document. Don’t use <xsl:value-of>; that will convert the entire node set to a string, and you’ll get the string value of only the first node.

To run this transformation with the program shown in the section called “An XSLT Transformation”, we use the following command line, which invokes the shell script from the section called “Transformation Script”:

oootransform.sh -in whitespace_test.xml \
  -xsl whitespace_test.xslt \
  -outOOo whitespace_test.sxw -out content.xml

The product is a file named whitespace_test.sxw. Figure 3.2, “Document Created with Whitespace XSLT Extension” shows the result; the font has been made larger, and non-printing characters are displayed so that we can check that the file has the correct content.

Before proceeding, let’s note two things about the preceding example. First, congratulations! We’ve just created our first OpenOffice.org document without using the OpenOffice.org application. Second, we sneaked the <text:p> element into the example.

If you’re creating an OpenOffice.org document with an XSLT transformation, you don’t want it to say “I am a Fugitive from a Chain Printer.” You will need to add styles to your document, and this will require three steps:

Font declarations are written as described in the section called “Font Declarations”. Rather than writing them yourself, you may wish to use an existing document, or you may create a document in OpenOffice.org which contains one letter from each font you will want. You can then unpack the files and copy the declarations. Your third option is to write them by hand.

No matter which method you use, we recommend that you put the resulting declarations in a separate file and include them with an <xsl:include>. This makes your primary stylesheet shorter, and it also allows re-use of the declarations in other transformations. Example 3.5, “Font Declarations Include File” shows a sample stylesheet for inclusion into your primary transformation; Example 3.6, “Using an Included Font Declaration File” shows how you would use it.

OpenOffice.org lets you change the format of individual characters, paragraphs, or pages, as you see in Figure 3.3, “OpenOffice.org Format Menu”. The following attributes of the <style:properties> element affect character styles. Most of these attributes come from the XSL-FO namespace. (The <style:properties> element will be contained in a <style:style> element.)

style:font-name

The name of a <style:font-decl>. If you do not have any font declarations, you may use a fo:font-family attribute with a font name as its value. This is cheating. Don’t do it. (We said we would warn you!)

fo:font-size

The text size, expressed either as a length or a percentage. For fonts, a length is expressed as a positive integer followed by pt (points). Other units of measurement will be converted to points when you view them in OpenOffice.org.

fo:font-weight

Values are bold and normal.

fo:font-style

Values are italic and normal.

style:text-underline, style:text-underline-color

Oy, you wouldn’t believe how many underlining styles you have available to you! none, single, double, dotted, dash, long-dash, dot-dash, dot-dot-dash, wave, bold, bold-dotted, bold-dash, bold-long-dash, bold-dot-dash, bold-dot-dot-dash, bold-wave, double-wave, and small-wave. The style:text-underline-color is specified as in fo:color and has the additional value of font-color, which makes the underline color the same as the current text color.

fo:color, style:text-background-color

Text color and background color in the form of a six-digit hex value. Example: #cc32f5

fo:font-variant

This can have a value of normal or small-caps.

fo:text-transform

Possible values are none, lowercase, uppercase, and capitalize. capitalize corresponds to the "Title" choice in OpenOffice.org’s Character Font Effects dialog, which capitalizes the first letter of every word. uppercase corresponds to the "Capitalize" choice, which displays all the words in uppercase.

style:text-position

This attribute is used to create superscripts and subscripts. It can have two values; the first value is either sub or super, or a number which is the percentage of the line height (positive for superscripts, negative for subscripts). An optional second value gives the text height as a percentage of the current font height. Examples: style:text-position="super" produces normal superscripts, and style:text-position="-30 50" produces a subscript at 30% of the font height below the baseline, with letters 50% of the current font height.

style:text-rotation-angle

Number of degrees to rotate text counterclockwise; the value can be 0, 90, or 270.

Before we go further, let’s put these to work. Figure 3.4, “Styled Headings” shows two headings. The first one is a level five heading which we have made red and italic. (If you are reading this in a printed book, use your imagination to see the color.) The second heading is a level five heading with the red italics applied to only some of the words. In order to apply styles to only part of a paragraph or heading, we need to enclose it in a <text:span> element, which delineates an inline area of text.

In any case, we do not apply the style attributes directly to the heading, paragraph, or span. Instead, we declare the style in the <office:automatic-styles> area and then use a text:style-name attribute in the <text:p>, <text:h>, or <text:span>.

Example 3.7, “Markup for <text:h>” shows the relevant excerpts of the XML for the two headings.

1 OpenOffice.org’s level 5 heading style is found in the styles.xml file; its style:family attribute shows that it applies to block elements such as paragraphs and headings.
2 This definition in content.xml creates a style based on Heading 5, so it is also has style:family="paragraph" and its style:name begins with P.
3 The inline style has a style:family="text" and its style:name begins with T.
4 Styles are always applied by referring to the appropriate text:style-name.

Paragraph styles affect the location, indention, and look of paragraphs (and headings). Here are some of the styles you will most commonly use.

fo:line-height

This specifies a fixed line height; specifying none lets OpenOffice.org do its normal line height calculation. Specifying a length (24pt) or a percentage (125%) may lead to overlapping or cut-off text if some characters are larger than the line height.

style:line-height-at-least

The value is a length which specifies the minimum line height.

style:line-spacing

The value is a length that specifies a fixed distance between lines in a paragraph.[3]

fo:text-align

Values for this attribute are start, end, center, and justify. OpenOffice.org maps “left” to start and “right” to end, no matter the directionality of the text.

fo:margin-left, fo:margin-right, fo:text-indent

The values for the margins are a positive length telling how far to indent from the given side; the value for the first-line indent can be either positive or negative. If you specify fo:text-indent, you must also specify margins.

fo:margin-top, fo:margin-bottom

The value for these attributes is a length, or a percentage relative to the parent style.

fo:break-before, fo:break-after

The value of column or page tells whether OpenOffice.org should put a column or page break before or after the paragraph. You may use only one of these in a style specification. The default value of auto lets OpenOffice.org make the decision as to whether a break is necessary before the text.

fo:background-color

This is the background color for the paragraph, expressed as a six-digit hex value.

style:background-transparency

A number ranging from 0 (opaque) to 100 (completely transparent).

You may draw borders on all four sides of a paragraph by specifying fo:border. You may set an individual side with fo:border-left, fo:border-right, fo:border-top, and fo:border-bottom. Each of these has a value of the form:

width style color

Where:

If you have a double border, you may completely control the spacing of the lines by specifying a style:border-line-width (or style:border-line-side) attribute which has three length specifiers as its value:

Example 3.8, “Border Specification” shows the markup required for a green double border on all four sides, with an inner line width of 0.5 millimeter, a distance of 0.25 millimeters between the lines, and an outer line width of 1 millimeter. The total width of the border is the sum of the individual widths and distances of the style:border-line-width attribute.

To set the padding between the border and the paragraph content, use the fo:padding attribute (for all four sides), or fo:padding-left, fo:padding-right, fo:padding-top, and fo:padding-bottom to set padding on sides individually. The value of these attributes is a length specifier.

So, how does the <text:tab-stop> element know where the tabs are? You tell it by adding a <style:tab-stops> element, which contains a list of <style:tab-stop> elements.

The <style:tab-stop> element has a required style:position attribute, whose value is a length specification. By default, tab stops are left-aligned, but you may change this with the style:type attribute, whose value may be one of left, center, right, or char. This last value lets you align on a specific character, such as a decimal point. The tab character is specified as the value of the style:char. Space between tab stops is normally filled with blanks. You may specify a different filler character (also called a "leader") as the value of the style:leader-char attribute. The leader character fills the space before the tab stop

Example 3.9, “Various Tab Stops” shows the XML for a paragraph with a left-aligned tab stop at one centimeter, a right-aligned stop at two centimeters, a centered stop at three centimeters with a dash as a leader character, and a tab stop on comma at four centimeters. Figure 3.5, “Tab Stops in OpenOffice.org” shows a paragraph using this formatting.

At this point, we know enough to write an XSLT document that will extract all the headings from an OpenOffice.org text document and create a new “outline” document. A heading at level x is preceded by x-1 tabs. Thus, a level one heading starts in the left margin of the document, and a level three heading has two tab stops followed by the heading text.

Example 3.10. Extracting Headings from an OpenOffice.org Document

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" 1
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:fo="http://www.w3.org/1999/XSL/Format" 
    xmlns:office="http://openoffice.org/2000/office" 
    xmlns:style="http://openoffice.org/2000/style" 
    xmlns:text="http://openoffice.org/2000/text" 
>

<xsl:output method="xml"
    doctype-public="-//OpenOffice.org//DTD OfficeDocument 1.0//EN"
    doctype-system="office.dtd"/>

<xsl:template match="/office:document-content">
    <office:document-content xmlns:office="http://openoffice.org/2000/office"
         xmlns:style="http://openoffice.org/2000/style"
         xmlns:text="http://openoffice.org/2000/text"
         xmlns:fo="http://www.w3.org/1999/XSL/Format"
         office:class="text">
    <office:script />
    <office:font-decls>
        <style:font-decl style:name="Lucidasans1"
        fo:font-family="Lucidasans" />
    </office:font-decls>
    <office:automatic-styles>
        <style:style style:name="P1" style:family="paragraph">
            <style:properties style:font-name="Lucidasans1"
            fo:font-size="12pt" style:font-size-asian="12pt"
            style:font-size-complex="12pt">
                <style:tab-stops> 2
                    <style:tab-stop style:position="1cm" />
                    <style:tab-stop style:position="2cm" />
                    <style:tab-stop style:position="3cm" />
                    <style:tab-stop style:position="4cm" />
                    <style:tab-stop style:position="5cm" />
                    <style:tab-stop style:position="6cm" />
                    <style:tab-stop style:position="7cm" />                  
                </style:tab-stops>
            </style:properties>
        </style:style>
    </office:automatic-styles>
    <office:body>
        <xsl:apply-templates select="//text:h"/> 3
    </office:body>
    </office:document-content>
</xsl:template>

<xsl:template match="text:h">
    <text:p text:style-name="P1">
    <xsl:call-template name="emit-tabs">
        <xsl:with-param name="n" select="@text:level - 1"/>
    </xsl:call-template>
    <xsl:value-of select="."/>
    </text:p>
</xsl:template>

<xsl:template name="emit-tabs"> 4
    <xsl:param name="n" select="0"/>
    <xsl:if test="$n != 0">
        <text:tab-stop/>
        <xsl:call-template name="emit-tabs">
            <xsl:with-param name="n" select="$n - 1"/>
        </xsl:call-template>
    </xsl:if>
</xsl:template>
</xsl:stylesheet>
1 To make this document shorter, we’ve included only the namespaces that are necessary both here and in the root element of the output document.
2 There are two ways to implement this stylesheet. You can create a separate style for each level of heading, or have a single style with all the tab stops in it. The second method seems more to the point, though it does make the processing a bit more difficult.
3 This is the place where laziness triumphs; the XPath expression finds all <text:h> elements at any level in the document. It’s inefficient in terms of machine time, but easy for us.
4 And this is where we pay for being lazy earlier. XSLT has doesn’t have a "for loop" such as those in procedural programming languages, so we need to produce the required number of <text:tab-stop> elements by recursion. This template gets a parameter, n; if the value is non-zero, the template emits a <text:tab-stop> into the output, then calls itself to emit n-1 more tab stops.

You start a new section of a document by enclosing your content in a <text:section> element. This element has a required text:name attribute, which has the same name as an existing <style:style> element. The <text:section> element also has an optional text:style-name attribute, whose value is an internal name for the section.

The <style:style> that the section refers to will have a style:family="section". As with all other <style:style>s, it will contain a <style:properties> element. If the contents of the columns are to be evenly distributed to all the columns, then the <style:properties> element will have its text:dont-balance-text-columns attribute set to false. A value of true indicates that column contents are not to be distributed equally.

The <style:properties> element will in turn contain a <style:columns> element The <style:columns> element has a required fo:column-count attribute, whose value is the number of columns. If the columns are equally spaced, the fo:column-gap attribute gives the spacing between the columns.

The <style:columns> element contains one <style:column> element for each column. Each column has these attributes:

style:rel-width

The proprotional width of the column expressed in twips[4] followed by an asterisk instead of a length unit. Thus, a one-inch wide column is specified as style:rel-width="1440*".

fo:margin-left, fo:margin-right

These specify the inter-column spacing in absolute units.

The style:rel-width includes inter-column spacing. Given the specifications shown in Figure 3.6, “Column Spacing”, the total width of the first column is 1.125 inches (one inch plus half of the quarter-inch spacing). The total width of the second column is 2.125 inches (1.75 inches plus half of the quarter-inch spacing plus half of the half-inch spacing). The total width of the second column is 2.25 inches (two inches plus half of the half-inch spacing).

If you place separators between columns, then you must place a <style:column-sep> element before the first <style:column> element. The <style:column-sep> element has these attributes:

Example 3.11, “OpenOffice.org Representation of Sections” shows the XML for the three-column section with column widths and spacing as shown in Figure 3.6, “Column Spacing”. The columns have a five-point vertically centered separator line, and the section is indented one half inch from each margin. Content is not distributed evenly among the columns.

If you do not distribute content equally, then you may need to insert a manual section break within the text. This is done by applying a style with fo:break-before="column" in its <style:properties>. Example 3.12, “Using a Section Style” shows the relevant style and content that uses the preceding section definition.

Inserting a page break works in the same way as a section break: a paragraph or heading references a style which has a <style:properties> element with a fo:break-before="page" attribute. Unlike sections, the specification for a page’s characteristics do not go into only the content.xml file, but in the styles.xml file as well. Figure 3.7, “Relationship Among Files When Specifying Pages” shows how the file contents are related.

The styles.xml file contains a <style:page-master> element for every different type of page your document uses. This element has a required style:name attribute and an optional style:page-usage attribute. The page usage can have a value of all (the default), left, right, and mirrored. If you use mirrored, then margins are mirrored as you move from page to page. The <style:page-master> elements are placed within the <office:automatic-styles> element.

The <style:page-master>’s content starts with a <style:properties> element that has these attributes:

fo:page-width, fo:page-length

the value is a length, such as 21.9cm.

fo:margin-top, fo:margin-bottom, fo:margin-left, fo:margin-right

the value is a length.

style:print-orientation

value is either portrait or landscape.

style:writing-mode

one of: lr-tb (left to right; top to bottom), rl-tb, tb-rl, tb-lr, lr, rl, tb, and page. I have no fscking clue what page does.

fo:background-color

a six-digit hex value, such as #ffff99; if omitted, the background is unfilled. If you are using a background image, then set fo:background-color to transparent, and use a <style:background-image> element as described in the section called “Background Images”.

style:num-format

the page number format; possible values are 1, the default of arabic numerals, a and A for lowercase and uppercase lettering, and i and I for lowercase and uppercase roman numerals.

style:footnote-max-height

the value is a length giving the maximum footnote height. If the value is zero, then the footnote area cannot be larger than the page area. Note: although a value of zero does not require a unit specifier, OpenOffice.org does add one, so you may see a value such as 0inch.

If your document contains footnotes, then the <style:properties> element, in turn, contains an optional <style:footnote-sep> element that describes how footnotes are separated from the main text. Its attributes are:

style:width

value is a length specifier giving the thickness of the separator line.

style:distance-before-sep, style:distance-after-sep

these values are length specifiers that give the distance before and after the footnote separator (as their names so aptly indicate).

style:adjustment

the alignment of the separator line; values are left (default), center, or right.

style:rel-width

how far the separator extends across the page, expressed as a percentage. Thus, a separator that takes up one fourth of the page width has a value of 25%.

But wait, that’s not all. If you have headers and footers on your page, you must add a <style:header-style> and/or <style:footer-style> as appropriate. Each of these tags contains a <style:properties> element that specifies any attributes you wish to apply to the header or footer. If you don’t have a header or footer, make these elements empty.

None of this is complicated; there’s just so much of it. The following is an outline of what a page master style looks like:

<style:page-master style:name="pm4">
    <style:properties page width, height, margins, writing mode>
        <style:footnote-sep line thickness, width, and distances />
    </style:properties>
    
    <style:header-style>
        <style:properties header specifications/>
    </style:header-style>
    
    <style:footer-style>
        <style:properties footer specifications/>
    </style:footer-style>
</style:page-master>

Example 3.13, “Full Page Master Specification” shows a complete specification for a landscape-oriented page that has both a header and footer. It has been reformatted for ease of reading. You may also wonder what is the absolute minimum that you can get away with; if you are not using headers or footers, you can make a workable portrait-oriented page master with the specifications shown in Example 3.14, “Minimal Page Master Specification”

In addition to the <office:automatic-styles> in the styles.xml file, you must have an <office:master-styles> element. This element contains header and footer content for each type of page, and also tells how pages follow one another. (For example, you might have a document where the first page is an envelope and the subsequent pages are letter-sized.)

The <office:master-styles> element contains one <style:master-page> for each <style:page-master> element that you have defined. The <style:master-page> element has a required style:name attribute which gives the name that appears in the OpenOffice.org style catalog. The other required attribute is style:page-master-name, whose value is the name of the <style:page-master> defined earlier in the automatic styles.

If this page master has a specific page style that follows it (for example, a “title page” might be followed by a “contents page”), you add a style:next-style-name attribute.

If your page has a header or footer, this is where its content goes; not in the content.xml file where you might expect it. Example 3.15, “Master Styles” shows the master styles section for a document with two page styles, one of which is a landscape oriented page with a header and footer.

The specifications for bulleted, numbered, and outline lists are contained entirely within the content.xml document, and are related to one another as shown in Figure 3.8, “Relationship Among Elements When Specifying Lists”.

The essential information is contained in the <text:list-style> element, which contains ten <text:list-level-style-bullet> elements if the list is all bulleted, <text:list-level-style-number> elements if the list is all numbered, or a mixture if the list is outlined. There are ten of these elements because OpenOffice.org allows a maximum of ten list levels.

The <text:list-level-style-bullet> and <text:list-level-style-number> elements have the following attributes in common:

The following attributes apply only to numbered lists.

Thus, if you had numbered items of the form (a), (b), etc., the appropriate attributes would be style:num-prefix="(", style:num-format="a", and style:num-suffix=")".

Each <text:list-level-style-number> or <text:list-level-style-bullet> element contains a <style:properties> which specifies:

Once you have established the list styles, you use them by creating a <text:unordered-list> for a bulleted list, or a <text:ordered-list> for a numbered or outline list. This element will have a text:style-name attribute that refers to the list style you want. Each item in the list will be contained within a <text:list-item> element.

Example 3.17, “XML for an Outline List” shows the relevant part of the XML that produces the outline shown in Figure 3.9, “Screenshot of an Outline List”.

Example 3.17. XML for an Outline List

<office:automatic-styles>

<style:style style:name="P1"
    style:family="paragraph"
    style:parent-style-name="Standard"
    style:list-style-name="L1" />

<text:list-style style:name="L1">
    <text:list-level-style-number
        text:level="1"
        text:style-name="Numbering Symbols"
        style:num-prefix="" style:num-suffix="." style:num-format="1">
        <style:properties
            text:min-label-width="0.1965inch" />
    </text:list-level-style-number>

    <text:list-level-style-number
        text:level="2"
        text:style-name="Numbering Symbols"
        style:num-prefix="" style:num-suffix=")" style:num-format="a">
        <style:properties
            text:space-before="0.1972inch"
            text:min-label-width="0.1965inch" />
    </text:list-level-style-number>
    
    <text:list-level-style-bullet
        text:level="3"
        text:style-name="Bullet Symbols"
        style:num-prefix="" style:num-suffix=""
        text:bullet-char="•">
        <style:properties
            text:space-before="0.3937inch"
            text:min-label-width="0.1965inch"
            style:font-name="StarSymbol" />
    </text:list-level-style-bullet>
    
    <!-- the bullet is repeated for levels 4 through 10 -->
</office:automatic-styles>

<office:body>
    <text:ordered-list text:style-name="L1">
        <text:list-item>
            <text:p text:style-name="P1">Cats</text:p>
            <text:ordered-list>
                <text:list-item>
                    <text:p text:style-name="P1">Shorthair</text:p>
                </text:list-item>
                <text:list-item>
                    <text:p text:style-name="P1">Longhair</text:p>
                </text:list-item>
            </text:ordered-list>
        </text:list-item>

        <text:list-item>
            <text:p text:style-name="P1">Dogs</text:p>
        </text:list-item>

        <text:list-item>
            <text:p text:style-name="P1">Fish</text:p>
        </text:list-item>
    </text:ordered-list>
</office:body>

In Example 3.10, “Extracting Headings from an OpenOffice.org Document”, we extracted the headings from an OpenOffice.org document and placed them into a new document. In this case study, we will add the headings to the current document. They will be represented as a bulleted list in a section at the beginning of the document. This is definitely the most ambitious example so far. In fact, at several points we almost abandoned the idea in favor of a simpler example. However, we realized that we would have to handle the tricky details at some point, and there was no time like the present.

This program is written in Java, and is run from the command line. It takes two arguments: the name of the original file, and the name of the new file. We had considered simply modifying the original file in place, but if you ran the program twice you’d end up with two sets of bullet items. Here’s the plan:

  • Copy all the JAR file entries other than content.xml directly to the new file.
  • Parse the content.xml JAR entry and build a document tree.
  • Add a new paragraph style, list style, and section style to the document tree. This avoids conflicts with existing styles. It also requires us to find the largest paragraph, list, and section style nubmers so that we can assign a unique identifier to our new styles.[5]
  • Add a new section at beginning of the document body.
  • Add a bulleted list to the section. The bullet levels correspond to the heading levels. This is an interesting process in itself, and we’ll talk about it when we get there.
  • Write the updated document tree to the output file as the new content.xml JAR entry.

We start with the declarations of imported classes:

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.io.UnsupportedEncodingException;

import java.util.Date;
import java.util.jar.JarEntry;
import java.util.jar.JarInputStream;
import java.util.jar.JarOutputStream;
import java.util.jar.Manifest;

import org.w3c.dom.Attr;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.DOMImplementation;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.w3c.dom.Text;

import org.apache.xerces.parsers.DOMParser;
import org.apache.xml.serialize.XMLSerializer;
import org.apache.xml.serialize.OutputFormat;

import org.xml.sax.InputSource;

We continue with the class declaration and class variables. In addition to finding the last list, paragraph, and section style numbers, we also need to keep track of their location in the document tree so that we can add our new styles immediately after the existing ones.


public class AddOutline
{
    /* The parsed document */
    protected Document document = null;
    
    /* Permanent pointer to root element of output document */
    protected Element documentRoot;

    /* List of heading elements */
    protected NodeList headingElements;

    /* Last numbers for paragraph, list, and section styles */
    int         lastParaNumber = 0;
    int         lastListNumber = 0;
    int         lastSectionNumber = 0;

    /* Node locations of last paragraph, list, and section styles */
    Node        lastParaNode = null;
    Node        lastListNode = null;
    Node        lastSectionNode = null;

    /* File descriptors */
    File        inFile;
    File        outFile;
    
    /* Streams for reading and writing JAR files */
    JarInputStream  inJar;
    JarOutputStream outJar;

The main program is simplicity itself; it checks for the proper number of arguments, creates a class, and hands the arguments to the modifyDocument method.

public static void main(String argv[]) {

    // check for proper number of arguments
    if (argv.length != 2) {
        System.err.println("usage: java AddOutline filename newfilename");
        System.exit(1);
    }

    AddOutline adder = null;

    adder = new AddOutline();
    adder.modifyDocument(argv);  

} // main(String[])

Here is the modifyDocument method

protected void modifyDocument(String argv[])
{
    JarEntry inEntry;
    JarEntry outEntry;

    /* Create file descriptors */
    inFile = new File(argv[0]); 1
    outFile = new File(argv[1]);

    openInputFile();

    /* Get the manifest from the input file */
    Manifest manifest = inJar.getManifest( );

    /* Open output file, copying manifest if it existed */
    try
    {
        if (manifest == null)
        {
            outJar = new JarOutputStream(new FileOutputStream(outFile));
        }
        else
        {
            outJar = new JarOutputStream(new FileOutputStream(outFile),
                manifest);
        }
    }
    catch (IOException e)
    {
        System.err.println("Unable to open output file.");
        System.exit(1);
    }

    try 2
    {
        byte    buffer[] = new byte[16384];
        int     nRead;

        while ((inEntry = inJar.getNextJarEntry()) != null)
        {
            if (!inEntry.getName().equals("content.xml"))
            {
                /*
                 * Create output entry based on information in
                 * corresponding input entry
                 */
                outEntry = new JarEntry(inEntry);
                outJar.putNextEntry(outEntry);
                
                /* Copy data */
                while ((nRead = inJar.read(buffer, 0, 16384)) != -1)
                {
                    outJar.write(buffer, 0, nRead);
                }
            }
        }

        inJar.close(); 3
        openInputFile();

        while ((inEntry = inJar.getNextJarEntry()) != null &&
            !(inEntry.getName().equals("content.xml")))
        {
            /* do nothing */
        }

        /*
         * Create output entry based on information in
         * corresponding input entry, but update its
         * timestamp.
         */
        outEntry = new JarEntry(inEntry);
        outEntry.setTime(new Date().getTime());
        outJar.putNextEntry(outEntry);

        4
        document = readContent();   /* parse content.xml */
        processContent();   /* add styles and bulleted list */
        writeContent();     /* write it to output JAR file */

        outJar.close();
    }
    catch (IOException e)
    {
        System.err.println("Error while creating new file");
        e.printStackTrace();
    }
}
1 Start by opening the input file, and copying its manifest file (if any) to the output file. We use a method to open the input file, because we’ll need to do it twice.
2 The next stage is to copy all the JAR entries other than the content.xml file. The try block that starts here extends to nearly the end of the method; any error gives a generic error message and a stack trace.
3 We must then close the input file and re-open it in order to find the content.xml JAR entry and process it. You may be wondering why we didn’t just do this as one loop, copying all the entries except content.xml and processing it specially. We tried that, and it doesn’t work; the XML parser closes its input file when it finishes, so the loop would fail with an “Input stream closed” error when it got to the entry after content.xml

In this second loop, we use a while loop with no body to get to the desired entry in the JAR file.

4 Here’s where the main work of creating the outline occurs; we will look at it in detail shortly.

Here’s the code that opens the input file; nothing special to see here—keep moving along.

public void openInputFile()
{
    try
    {
        inJar = new JarInputStream(new FileInputStream(inFile));
    }
    catch (IOException e)
    {
        System.err.println("Unable to open input file.");
        System.exit(1);
    }
}

It doesn’t take much code to parse the XML file either. We need to create a parser, set its input source to the entry from the JAR file, and get the result when everything finishes. We must also have Xerces ignore the (non-existent) office.dtd in the <!DOCTYPE>; the relevant line is shown in boldface, and uses the ResolveOfficeDTD class described in the section called “Getting Rid of the DTD”.

public Document readContent( )
{
    try
    {
        DOMParser parser = new org.apache.xerces.parsers.DOMParser();
        parser.setEntityResolver(new ResolveOfficeDTD());
        parser.parse(new InputSource(inJar));
        return parser.getDocument();
    }
    catch (Exception e)
    {
        e.printStackTrace(System.err);
        return null;
    }
}

Now we must add the styles and text to the document tree; this is done in method processContent. The lines shown in boldface are methods that do much of the heavy lifting; we will look at each of those methods individually.

public void processContent()
{
    Node        autoStyles; /* the <office:automatic-styles> element */
    Element     bodyStart;  /* the <office:body> element */
    Element     textStart;  /* place to insert new text */
    Element     element;    /* used for any element we create */

    if (document == null)
    {
        return;
    }

    documentRoot = (Element) document.getDocumentElement();

    headingElements = document.getElementsByTagName("text:h"); 1
    if (headingElements.getLength() == 0)
    {
        return;
    }

    autoStyles = findFirstChild(documentRoot, "office:automatic-styles"); 2
    findLastItems(autoStyles.item(0));

    /*
     * Prepare to add the new styles by going to the next
     * available number. We will insert the new style before
     * the next sibling of the last node.  3
     */
    lastParaNumber++;
    lastListNumber++;
    lastSectionNumber++;
    if (lastParaNode != null)
    {
        lastParaNode = lastParaNode.getNextSibling();
    }
    if (lastListNode != null)
    {
        lastListNode = lastListNode.getNextSibling();
    }
    if (lastSectionNode != null)
    {
        lastSectionNode = lastSectionNode.getNextSibling();
    }

    /*
     * Create a <style:style> element for the new paragraph,
     * set its attributes and insert it after the last paragraph
     * style.
     */
    element = document.createElement("style:style");
    element.setAttribute("style:name", "P" + lastParaNumber);
    element.setAttribute("style:family", "paragraph");
    element.setAttribute("style:list-style-name", "L" + lastListNumber);
    element.setAttribute("style:parent-style-name", "Standard");        
    autoStyles.item(0).insertBefore(element, lastParaNode);

    /*
     * Create a <style:style> element for the new section,
     * set its attributes and insert it after the last section
     * style.
     */
    element = document.createElement("style:style");
    element.setAttribute("style:name", "Sect" + lastSectionNumber);
    element.setAttribute("style:family", "section");
    addSectionProperties(element);
    autoStyles.item(0).insertBefore(element, lastSectionNode);

    /*
     * Create a <text:list-style> element for the new list,
     * set its attributes and insert it after the last list
     * style.
     */
    element = document.createElement("text:list-style");
    element.setAttribute("style:name", "L" + lastParaNumber);
    addBullets(element);
    autoStyles.item(0).insertBefore(element, lastListNode);

    /*
     * Now proceed to where we will add text;
     * it's just after the first <text:sequence-decls>
     * in the <office:body> 
     */
    bodyStart = findFirstChild(documentRoot, "office:body");
    textStart = findFirstChild(bodyStart, "text:sequence-decls");
    textStart = getNextElementSibling( textStart ); 4

    /*
     * Add a section
     */
    element = document.createElement("text:section");
    element.setAttribute("text:style-name", "Sect" + lastSectionNumber);
    element.setAttribute("text:name", "Section" + lastSectionNumber);
    addHeadings(element);
    
    bodyStart.insertBefore(element, textStart);
}
1 Gather up all the <text:h> elements in the document; if there are none, then our job here is done.
2 This line uses a utility method (findFirstChild) to find the first child of the document root whose element name is office:automatic-styles.
3 Expanding on the comment: we want to place the new paragraph style after the last paragraph style in the current document, the new section style after the last existing style, etc. However, there’s no insertAfter method, so we have to insertBefore the next sibling of the desired node.
4 We haven’t discussed the <text:sequence-decls> element yet; it’s used for numbering items in OpenOffice.org documents. The main text in your document normally immediately follows this element. However, if you have any text nodes (such as newlines) between elements, getNextSibling will fail; thus, we use our own utility method getNextElementSibling.

As long as we’re talking about the utility routines, they’re fairly short, so we may as well present them here:

/*
 * Find first element with a given tag name
 * among the children of the given node.
 */
public Element findFirstChild(Node startNode, String tagName)
{
    startNode = startNode.getFirstChild();
    while (! (startNode != null &&
        startNode.getNodeType() == Node.ELEMENT_NODE &&
        ((Element)startNode).getTagName().equals(tagName)))
    {
        startNode = startNode.getNextSibling();
    }
    return (Element) startNode;
}

/*
 *  Find next sibling that is an element
 */
public Element getNextElementSibling( Node node )
{
    node = node.getNextSibling();
    while (node != null &&
        node.getNodeType() != Node.ELEMENT_NODE)
    {
        node = node.getNextSibling();
    }
    return (Element) node;
}

The next method, addSectionProperties is more of a convenience method than anything else:

/*
 * Add the appropriate properties to make a single-column
 * section
 */
public void addSectionProperties(Element sectionStyle)
{
    Element properties;
    Element columns;

    properties = document.createElement("style:properties");
    properties.setAttribute("text:dont-balance-text-columns",
        "false");

    columns = document.createElement("style:columns");
    columns.setAttribute("fo:column-count", "0");
    columns.setAttribute("fo:column-gap", "0cm");

    properties.appendChild(columns);
    sectionStyle.appendChild(properties);
}

The addBullets method is also quite straightforward; it’s a simple loop to create the ten levels of bullet styles. All levels except the first have text:space-before. Ordinarily OpenOffice.org creates its bullet styles with the StarSymbol font, and references a style named Bullet Symbols in the styles.xml file. We are dispensing with that, so our new document will use the bullet symbol from the default font.

/*
 * Add the ten bullet styles to the <text:list-style> element. 
 */
public void addBullets(Element listLevelStyle)
{
    int     level;
    Element bullet;
    Element properties;
    for (level = 1; level <= 10; level++)
    {
        bullet = document.createElement("text:list-level-style-bullet");
        bullet.setAttribute("text:level", Integer.toString(level));
        bullet.setAttribute("text:bullet-char", "\u2022");

        properties = document.createElement("style:properties");
        if (level != 1)
        {
            properties.setAttribute("text:space-before",
                Double.toString((level-1) * 0.5) + "cm");
        }
        properties.setAttribute("text:min-label-width", "0.5cm");
        bullet.appendChild(properties);
        listLevelStyle.appendChild(bullet);
    }
}

Adding the headings is, conceptually, a recursive process, since each new level of heading opens a nested list. However, there is no guarantee that heading levels will increase and decrease sequentially; a level three heading can be followed by a level seven heading, followed by a level one heading. (This is not good document design, but it is certainly possible.) Thus, rather than write this method recursively, we decided to use an array to simulate a stack. Here’s the pseudo-code:

/* Add headings to a section */
public void addHeadings(Element startElement)
{
    int currentLevel = 0;
    int headingLevel;
    int i;
    int level;
    Element ulist[] = new Element[10];
    Element listItem;
    Element paragraph;
    Text    textNode;

    for (i=0; i < headingElements.getLength(); i++)
    {
        headingLevel = Integer.parseInt(
            ((Element)headingElements.item(i)).getAttribute("text:level"));
        if (headingLevel > currentLevel)
        {
            for (level = currentLevel; level < headingLevel; level++)
            {
                ulist[level] = document.createElement(  1
                    "text:unordered-list");
                if (level == 0)
                {
                    ulist[level].setAttribute("text:style-name",
                        "L" + lastListNumber);
                }
            }
            currentLevel = headingLevel;
        }
        else if (headingLevel < currentLevel)
        {
            closeLists(ulist, currentLevel, headingLevel); 2
            currentLevel = headingLevel;
        }

        /* Now append this heading as an item to current level */

        listItem = document.createElement("text:list-item");  3
        paragraph = document.createElement("text:p");
        textNode = document.createTextNode("");
        textNode = accumulateText(
            (Element) headingElements.item(i),
            textNode);
        paragraph.appendChild(textNode);
        listItem.appendChild(paragraph);
        ulist[currentLevel-1].appendChild(listItem);

    }
    if (currentLevel != 1) 4
    {
        closeLists(ulist, currentLevel, 1);
    }
    startElement.appendChild(ulist[0]);
}   
1 We add levels by creating <text:unordered-list> elements. Only the first level unordered list has a text:style-name attribute.
2 The work of closing lists when the level decreases has been passed on to a separate routine.
3 No matter whether we have added levels, closed levels, or are at the same level, we have to add a <text:list-item> at the current level. The accumulateText method gathers all the text nodes in the heading.
4 Another call to the closeLists method closes any nested lists.

Here is the closeLists method. Each <text:unordered-list> that is being closed is within a <text:list-item> element of its parent list.


    /*
     * Join elements in the ulist[] array to close all open lists
     * from currentLevel back down to newLevel
     */
    public void closeLists(Element ulist[], int currentLevel, int newLevel)
    {
        int i;
        Element listItem;
        for (i = currentLevel-1; i > newLevel-1; i--)
        {
            listItem = document.createElement("text:list-item");
            listItem.appendChild(ulist[i]);
            ulist[i-1].appendChild(listItem);
        }
    }

Finally, the accumulateText method, which recursively visits all the child nodes of the <text:h> element in question. Tabs and line breaks are replaced with a single blank each; any other elements are ignored.

/*
 * Return a Text node that contains all the accumulated text
 * from the child nodes of the given element.
 */
public Text accumulateText(Element element, Text text)
{
    Node node = element.getFirstChild();
    while (node != null)
    {
        if (node.getNodeType() == Node.TEXT_NODE)
        {
            text.appendData(((Text) node).getData());
        }
        else if (node.getNodeType() == Node.ELEMENT_NODE)
        {
            if (((Element) node).getTagName().equals("text:tab-stop") ||
                ((Element) node).getTagName().equals("text:line-break"))
            {
                text.appendData(" ");
            }
            else
            {
                text = accumulateText((Element) node, text);
            }
        }
        node = node.getNextSibling();
    }
    return text;
}

This covers all the methods used to process the document tree. All that remains is the method to serialize the document tree, writing it to the new document’s content.xml JAR file entry.

public void writeContent()
{
    if (document == null)
    {
        return;
    }
    PrintWriter out = null;
    try
    {
        out =
        new PrintWriter(new OutputStreamWriter(outJar, "UTF-8"));  1
    }
    catch (Exception e)
    {
        System.out.println("Error creating output stream");
        System.out.println(e.getMessage());
        System.exit(1);
    }
    OutputFormat oFormat = new OutputFormat("xml", "UTF-8", false); 2
    XMLSerializer serial = new XMLSerializer(out, oFormat); 3
    try
    {
        serial.serialize(document);
    }
    catch (java.io.IOException e)
    {
        System.out.println(e.getMessage());
    }
}
1 First, construct an output stream with your favorite encoding method
2 A serializer requires an output format. This constructor’s three parameters are the output method (which is normally one of "xml", "html", or "text"); the character encoding, which should be "UTF8" to keep your international clients happy; and a boolean that tells whether the output should be indented or not. For OpenOffice.org documents, this should be set to false to avoid unwanted whitespace text nodes.
3 The OutputFormat is used when creating the serializer.


[3] fo:line-height, style:line-height-at-least, and style:line-spacing are mutually exclusive.

[4] A twip is one twentieth of a point, so there are 1440 twips per inch.

[5] The style names are in the format P, L, and Sect followed by an integer; this is the “number” we are referring to.


Creative Commons License Content licensed under a Creative Commons License.
All content is copyright O’Reilly & Associates, Inc.
During development, I give permission for non-commercial copying for educational and review purposes. After publication, all text will be released under the Free Software Foundation’s GNU Free Documentation License.