Chapter 2. The meta.xml, styles.xml, settings.xml, and content.xml Files

Though content.xml is king, monarchs rule better when surrounded by able assistants. In an OpenOffice.org document, these assistants are the meta.xml, style.xml, and settings.xml files. In this chapter, we will examine the assistant files, and then describe the general structure of the content.xml file.

Not that none of these files is actually necessary; if you create only a content.xml file that contains word processor elements and zip it up, OpenOffice.org will open it successfully. The result will be a plain text only document with no styles. You won’t have any of the meta-information about who created the file or when it was last edited, and the printer settings, view area, and zoom factor will be set to the OpenOffice.org defaults.

The meta.xml file contains information about the document itself, and this can be generally useful. We’ll look at the elements found in this file in decreasing order of importance; at the end of this section, we will list them in the order in which they appear in a document. Most of these elements are reflected in the tabs on the File/Properties dialog, which are show in Figure 2.1, “General Document Properties”, Figure 2.2, “Document Description”, Figure 2.3, “User-defined Information”, and Figure 2.4, “Document Statistics”.

All of the elements borrowed from the Dublin Core namespace contain text and have no attributes. Table 2.1, “Dublin Core Elements in meta.xml” summarizes them.

Table 2.1. Dublin Core Elements in meta.xml

ElementDescriptionSample from XML file

<dc:title>

The document title; this appears in the title bar.

<dc:title>An Introduction to Digital Cameras</dc:title>

<dc:subject>

The Dublin Core recommends that this element contain keywords or key phrases to describe the topic of the document; OpenOffice.org keeps keywords in a separate set of elements.

<dc:subject>Digital Photography</dc:subject>

<dc:description>

This element’s content is shown in the Comments field in the dialog box.

<dc:description>This introduction…</dc:description>

<dc:creator>

This element’s content is shown in the Modified field in Figure 2.1, “General Document Properties”; it names the last person to edit the file. This may appear odd, but the Dublin Core says that the creator is simply an “entity primarily responsible for making the content of the resource.” That is not necessarily the original creator, whose name is stored in a different element.

<dc:creator>Steven L. Eisenberg</dc:creator>

<dc:date>

This element’s content is also shown in the Modified field in Figure 2.1, “General Document Properties”. It is stored in a form compatible with ISO-8601. The time is shown in local time. See the section called “Time and Duration Formats” for details about times and dates.

<dc:date>2003-06-30T22:39:05</dc:date>

<dc:language>

The document’s language, written as a two or three-letter main language code followed by a two-letter sublanguage code. This field is not shown in the properties dialog, but is found in OpenOffice.org’s Tools/Options/Language Settings dialog.

<dc:language>en-US</dc:language>

The remaining elements in the meta.xml file come from OpenOffice.org’s meta namespace. Table 2.2, “OpenOffice.org Elements in meta.xml” describes these elements in the order in which they appear in the file.

Table 2.2. OpenOffice.org Elements in meta.xml

ElementDescriptionSample from XML file

<meta:generator>

The program that created this document. According to the specifcation, you should not “fake” being OpenOffice.org if you are creating the document using a different program; you should use a unique identifier.

<meta:generator>OpenOffice.org 1.1 (Linux)</meta:generator>

<meta:initial-creator>

The user who created the document. This is shown in the "Created:" area in Figure 2.1, “General Document Properties”.

<meta:initial-creator>J David Eisenberg</meta:initial-creator>

<meta:creation-date>

The date and time when the document was created. This is shown in the “Created:” area in Figure 2.1, “General Document Properties”. It is in the same format as described in the section called “Time and Duration Formats”.

<meta:creation-date>2003-06-04T14:53:55</meta:creation-date>

<meta:keywords>

This element contains one or more <meta:keyword> elements. These elements reflect the entries in the “Keywords:” area in Figure 2.2, “Document Description”.

<meta:keywords>
    <meta:keyword>photography</meta:keyword>
    <meta:keyword>cameras</meta:keyword>
    <meta:keyword>optics</meta:keyword>
    <meta:keyword>digital cameras</meta:keyword>
</meta:keywords>

<meta:editing-cycles>

This element tells how many times the file has been edited; this is the “Document Number:” in in Figure 2.1, “General Document Properties”.

<meta:editing-cycles>15</meta:editing-cycles>

<meta:editing-duration>

This element tells the total amount of time that has been spent editing the document in all editing sessions; this is the “Editing time:” in Figure 2.1, “General Document Properties”, and is represented as described in the section called “Time and Duration Formats”.

<meta:editing-duration>PT1H36M12S</meta:editing-duration>

<meta:user-defined>

OpenOffice.org allows you to define your own information, as shown in Figure 2.3, “User-defined Information”. This element has a meta:name attribute, giving the “title” of this information, and the content of the element is the information itself.

<meta:user-defined meta:name="Maximum Length">3 pages or
750 words</meta:user-defined>

<meta:document-statistic>

This is the information shown on the statistics tab of the properties dialog (see Figure 2.4, “Document Statistics”). This element has attributes whose names are largely self-explanatory, and are listed in Table 2.3, “Attributes of the <meta:document-statistic> Element”.

<meta:document-statistic meta:paragraph-count="4"…/>

Now that we know what the format of the meta file is, let’s construct a Perl program to extract that information. Again, rather than reinvent the wheel, we will use two existing modules from the Comprehensive Perl Archive Network, CPAN (http://www.cpan.org/). The first of these, Archive::Zip::MemberRead, will let us read the meta.xml file directly from the compressed OpenOffice.org document. We will use the XML::Simple module to do the main work of the extraction program.

The program that actually does the extraction, Example 2.3, “Program show_meta.pl”, takes one argument: the OpenOffice.org document filename. The program receives its input from the piped output of member_read.pl.

After the file is parsed, the program prints the data. Information in the <meta:document-statistic> is selected depending upon the type of document being parsed. The program also uses the Text::Wrap module to format the description, which may be several lines long.

Example 2.3. Program show_meta.pl

#!/usr/bin/perl

#
#   Show meta-information in an OpenOffice.org document.
#
use XML::Simple;
use IO::File;
use Text::Wrap;
use Carp;
use strict;

my $suffix;     # file suffix

#
#   Check for one argument: the name of the OpenOffice.org document
#
if (scalar @ARGV != 1)
{
    croak("Usage: $0 document");
}

#
#   Get file suffix for later reference
#
($suffix) = $ARGV[0] =~ m/\.(\w\w\w)$/;

#
#   Parse and collect information into the $meta hash reference
#
$ARGV[0] =~ s/[;|'"]//g;  #eliminate dangerous shell metacharacters     
my $fh = IO::File->new("perl member_read.pl $ARGV[0] meta.xml |");
my $xml= XMLin( $fh, forcearray => [ 'meta:keyword'] );
my $meta= $xml->{'office:meta'};

#
#   Output phase
#
print "Title:       $meta->{'dc:title'}\n"
    if ($meta->{'dc:title'});
print "Subject:     $meta->{'dc:subject'}\n"
    if ($meta->{'dc:subject'});

if ($meta->{'dc:description'})
{
    print "Description:\n";
    $Text::Wrap::columns = 60;
    print wrap("\t", "\t", $meta->{'dc:description'}), "\n";
}

print "Created:     ";
print format_date($meta->{'meta:creation-date'});
print " by $meta->{'meta:initial-creator'}"
    if ($meta->{'meta:initial-creator'});
print "\n";

print "Last edit:   ";
print format_date($meta->{"dc:date"});
print " by $meta->{'dc:creator'}"
    if ($meta->{'dc:creator'});
print "\n";

# Display keywords (which all appear to be in a single element)
#
print "Keywords:    ", join( ' - ',
  @{$meta->{'meta:keywords'}->{'meta:keyword'}}), "\n"
    if( $meta->{'meta:keywords'});

#
#   Take attributes from the meta:document-statistic element
#   (if any) and put them into the $statistics hash reference
#
my $statistics= $meta->{'meta:document-statistic'};
if ($suffix eq "sxw")
{
        print "Pages:       $statistics->{'meta:page-count'}\n";
        print "Words:       $statistics->{'meta:word-count'}\n";
        print "Tables:      $statistics->{'meta:table-count'}\n";
        print "Images:      $statistics->{'meta:image-count'}\n";
}
elsif ($suffix eq "sxc")
{
        print "Sheets:      $statistics->{'meta:table-count'}\n";
        print "Cells:       $statistics->{'meta:cell-count'}\n"
                if ($statistics->{'meta:cell-count'});
}


#
#   A convenience subroutine to make dates look
#   prettier than ISO-8601 format.
#
sub format_date
{
    my $date = shift;
    my ($year, $month, $day, $hr, $min, $sec);
    my @monthlist = qw (Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec);
    
    ($year, $month, $day, $hr, $min, $sec) =
        $date =~ m/(\d{4})-(\d{2})-(\d{2})T(\d{2}):(\d{2}):(\d{2})/;
    return "$hr:$min on $day $monthlist[$month-1] $year";
}   

These two lines from the preceding program are where all the parsing takes place:

my $fh = IO::File->new("perl member_read.pl $ARGV[0] meta.xml |");
my $xml= XMLin( $fh, forcearray => [ 'meta:keyword'] );

In the first line, we used IO::File->new, because our version of Perl wouldn’t read from a file handle opened with the standard Perl open. In the second line, the forcearray parameter will force the content of the <meta:keyword> element to be an array type, even if there is only one element. This avoids scalar vs. array problems.

While XML::Simple is the easiest way to accomplish this task, it is not the most flexible way to parse XML. For more general XML parsing, you probably want to use the XML::SAX module. the section called “Showing Meta-information Using SAX” shows this same program written with the XML::SAX module.

The styles.xml file contains information about the styles that are used in the document. Some of this information is also duplicated in the content.xml document.

File styles.xml begins with a <office:document-styles> element, which contains font declarations (<office:font-decls>), default and named styles (<office:styles>), "automatic," or unnamed styles (<office:automatic-styles>), and master styles (<office:master-styles>). All of these elements are optional.

The <office:styles> element is a container for (among other things) default styles and named styles from OpenOffice.org’s Stylist tool. A spreadsheet’s <office:styles> element will also contain information about style for numbers, currency, percentage values, dates, times, and boolean data. A drawing will have information about default gradients, hatch patterns, fill images, markers, and dash patterns for drawing lines.

The most important elements that you will find within <office:styles> are <style:default-style> and <style:style>. These elements both contain a style:family attribute which tells what “level” the style applies to. The possible values of this required attribute are: text (character level), paragraph, section, table, table-column, table-row, table-cell, table-page, chart, graphics, default, drawing-page, presentation, control, and ruby[1]

Both <style:default-style> and <style:style> have a style:name attribute. Styles built in to the stylist, or ones that you create there, will have names like Heading 1 or Custom Citation. Automatic styles will have names consisting of a one- or two-letter abbreviation followed by a number; a style name such as T1 is the first automatic style for style:family="text"; P3 would be the third style for paragraphs, ta2 would be the second style for a table, ro4 would be the fourth style for a table row, etc.

The other attribute of interest is the optional parent-style-name, which you will find in styles that have been derived from other styles. In a text document, OpenOffice.org will often create a temporary style whose parent is the style found in the styles.xml file.

Within each <style:style> or <style:default-style>, you will find the <style:properties> element, which describes the style in minute detail via an immense[2] number of attributes. A full discussion of styles is beyond the scope of this book, so we will simply give you an idea of the range of style specifications, and take up specific details of styles when they are relevant in other chapters. Example 2.4, “Style Defintion in a Word Processing Document”, Example 2.5, “Style Defintion in a Spreadsheet Document”, and Example 2.6, “Style Defintion in a Drawing Document” are excerpts from the styles.xml files in a word processing, spreadsheet, and drawing document

Although the details of the content.xml vary widely depending upon the type of document you are dealing with, there are elements which are common to all content.xml files. The root element is the <office:document-content> element. It defines all the namespaces that will be used throughout the document, and, most important, has the office:class attribute, which tells you what kind of document you have. The possible values for this attribute are text, text-global, drawing, presentation, spreadsheet, and chart. The office:version attribute tells you which version of OpenOffice.org created the document.

The following elements are contained within the <office:document-content> element. The optional <office:meta> and <office:settings> elements, usually absent in normal documents, can contain the same information and structure as found in the meta.xml and settings.xml files. The optional <office:script> element does appear in most documents and is always empty, even if your document contains macros. Go figure.

The <office:script> is followed by elements that describe the document’s presentation. The optional <office:font-decls> element describes fonts used in your document, and duplicates the information found in styles.xml. The optional <office:styles> element appears to be unused. If you have defined any styles “on the fly,” then these automatic styles are described in the optional <office:automatic-styles> element. This is followed by the optional <office:master-styles> element, which is absent in most documents.

The last child element of <office:document-content> is the required, and all-important, <office:body> element. This is where all the action is, and we will spend much of the rest of this book examining its contents. Example 2.7, “Structure of the content.xml file” shows the skeleton for an OpenOffice.org document’s content.xml file.



[1] Ruby refers to “furigana,” which are small Japanese alphabetic characters placed near the Japanese ideograms to aid readers in determining their correct meaning.

[2] As of this writing, “immense” is defined as 514.


Creative Commons License Content licensed under a Creative Commons License.
All content is copyright O’Reilly & Associates, Inc.
During development, I give permission for non-commercial copying for educational and review purposes. After publication, all text will be released under the Free Software Foundation’s GNU Free Documentation License.