[hfst-development] HFST3 header and the I/O streams

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi all,

For those who don't know me, I am a GSoC student and have been developing
the hfst-proc tool for Apertium to allow integrating morphological
analysis/generation with HFST transducers into the Apertium MT pipeline. A
secondary goal of my project is to get foma transducers working with
hfst-proc, so I have started working towards getting the necessary tools
working for getting foma transducers converted to the HFST optimized lookup
format.

After working with the HfstInput/OutputStream classes a bit, I have a
question about the design of the HFST3 header processing. On the input side
of things, header processing is currently split between the HfstInputStream
frontend where detection of the transducer type is done, and the backend
implementation classes which are also aware of the header so they can skip
past it when loading. Writing the header is also done by the backend
implementations.

My understanding of the header is that it is supposed to encapsulate the
actual transducer. If that is the case, would it not be more sensible to
have all the header processing handled in the frontend classes, and leaving
the implementation classes unaware of the header?

I also have a question about foma transducer I/O. The HFST specific
write/read functions in foma's io.c work only with a plain-text format,
while foma's native functions gzip the entire thing. Is there a reason for
HFST's not doing the same thing?

Cheers,
--Brian Croom