OpenDocumentTextInputStream.java
Presume you’re writing an application that scans text files and
gets word and character counts. If you are just reading plain text files,
there’s no problem–open up a
FileReader or a InputStreamReader on a
FileInputStream, and you’re ready to go.
But what if the text you want to analyze is inside a file that was written in Open Document Format? You have two problems to solve:
The first problem is easily solved; open up a
ZipInputStream on a
FileInputStream and advance to the content.xml entry.
The second problem–extracting the raw text–still remains.
Class
OpenDocumentTextInputStream is a class that extends
FilterInputStream.
If you open a OpenDocumentTextInputStream on the
ZipInputStream, you will get only the bytes that are inside
<text:p> and <text:h> elements,
unless they are inside a
<text:tracked-changes> element.
If you look at the Relax NG specification for Open Document, you will see that
everything important eventually has to end up in either a
<text:p> or <text:h>; this includes the
contents of table cells. You want to ignore those elements inside of
<text:tracked-changes> so that you don’t process
text that has been marked as deleted.
OpenDocumentTextInputStream has a constructor that lets you
specify an ArrayList of elements you want to capture text from
and an ArrayList of elements you want to ignore. Each of
these ArrayList contains
ElementPostProcess objects.
The ElementPostProcess object has two fields:
the name of the element (without a namespace prefix),
and the postProcess one-byte character that the input stream will
emit after processing that element. This is how the
the input stream can reasonably handle empty elements like
<text:tab> and <text:s>. If you
don’t want any character emitted after handling an element,
set postProcess to '\0'.