ODF Utilities > OpenDocumentTextInputStream.java

OpenDocumentTextInputStream.java

See API documentation

Presume you’re writing an application that scans text files and gets word and character counts. If you are just reading plain text files, there’s no problem–open up a FileReader or a InputStreamReader on a FileInputStream, and you’re ready to go.

But what if the text you want to analyze is inside a file that was written in Open Document Format? You have two problems to solve:

  1. The content.xml file is wrapped up in a .zip file format.
  2. The content.xml file is filled with all sorts of pesky XML tags that you don’t want to process.

The first problem is easily solved; open up a ZipInputStream on a FileInputStream and advance to the content.xml entry. The second problem–extracting the raw text–still remains.

The Solution

Class OpenDocumentTextInputStream is a class that extends FilterInputStream. If you open a OpenDocumentTextInputStream on the ZipInputStream, you will get only the bytes that are inside <text:p> and <text:h> elements, unless they are inside a <text:tracked-changes> element.

Why Those Elements?

If you look at the Relax NG specification for Open Document, you will see that everything important eventually has to end up in either a <text:p> or <text:h>; this includes the contents of table cells. You want to ignore those elements inside of <text:tracked-changes> so that you don’t process text that has been marked as deleted.

Changing The Elements

OpenDocumentTextInputStream has a constructor that lets you specify an ArrayList of elements you want to capture text from and an ArrayList of elements you want to ignore. Each of these ArrayList contains ElementPostProcess objects.

The ElementPostProcess object has two fields: the name of the element (without a namespace prefix), and the postProcess one-byte character that the input stream will emit after processing that element. This is how the the input stream can reasonably handle empty elements like <text:tab> and <text:s>. If you don’t want any character emitted after handling an element, set postProcess to '\0'.