Menu

#255 Hfst-pmatch preprocessor

future
open
1
2014-07-07
2014-07-07
Mike Voets
No

$ hfst-info
No tests selected; printing known data
HFST info version: 0.1
HFST packaging: hfst 3.7.1
HFST version: 3.7.1
HFST long version: 300070001
HFST configuration revision: $Revision: 3900 $
OpenFst supported
SFST supported
Unicode support: glib


$ uname -a
Linux mike-HP-ProBook-6560b 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:45:01 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux


A simple hfst-pmatch preprocessor that copies the filename of a binary transducer into a regular expression file. In order to apply the patch, the name of the directory should be "hfst/":

$ patch -p0 -i hfst-with-pmatch-pproc.patch

The patch will modify configure.ac in hfst/ and Makefile.am in hfst/tools/src/. Also, hfst-pmatch-proc.cc will be added into the hfst/tools/src/ directory. When the patch is applied, the patched hfst can be made using:

$ make && sudo make install

In a langs directory (e.g. langs/sme), the regexp.pmatch file can be made. It is important to write @InsertAnalyserBin at a place where it should be replaced by for example @bin"analyser-gt-desc.hfst". Look at the next example:

/// (e.g. regexp.pmatch in langs/sme/tools/preprocessor)
Define Terminator {!} | {?} | {.} | {,} | {;} | {:};
Define WhiteSpace Whitespace EndTag(WS) ;
Define FormatMarkUp [{<} | {</}] Alpha+ [{>}] EndTag(Format) ;

Define Deliminator # | WhiteSpace | Terminator | FormatMarkUp ;

Define Word LC(Deliminator) @InsertAnalyserBin RC(Deliminator) EndTag(SamWord) ;

Define TOP Word | FormatMarkUp | WhiteSpace ;
///

After applying (in langs/sme/tools/preprocessor as an example):

$ cd langs/sme/tools/preprocessor
$ hfst-pmatch-pproc -i ../../src/analyser-gt-desc.hfst

, it will compile this pmatch file and write to regexp.hfst in binary hfst format:

/// (temporary pmatch file to be compiled, see @bin"../../src/analyser-gt-desc.hfst")
Define Terminator {!} | {?} | {.} | {,} | {;} | {:};
Define WhiteSpace Whitespace EndTag(WS) ;
Define FormatMarkUp [{<} | {</}] Alpha+ [{>}] EndTag(Format) ;

Define Deliminator # | WhiteSpace | Terminator | FormatMarkUp ;

Define Word LC(Deliminator) @bin"../../src/analyser-gt-desc.hfst" RC(Deliminator) EndTag(SamWord) ;
///

Finally, the hfst-pmatch tool can be used like this:

$ echo "Mun ja son." | hfst-pmatch regexp.hfst
<SamWord>mun+Pron+Pers+Sg1+Nom</SamWord><WS> </WS><SamWord>ja+CC</SamWord><WS> </WS><SamWord>son+Pron+Pers+Sg3+Nom</SamWord>.

1 Attachments

Discussion