|
From: Tommi A. P. <tom...@he...> - 2010-07-14 06:50:03
|
[Sorry for the slow answer] 2010-07-05, Brian Croom sanoi: > I'm looking into restoring the lookup functionality into libhfst that > hasn't made it over from HFST2 yet, but I'm not sure what philosophy > the library should be following. Should lookup/analysis functions > (and support functions for tokenizing input strings) from the backend > libraries be driving the lookup? SFST and foma both have such > functions exposed, while OpenFST does not directly. This approach > leads to considerable variance in the lookup operation with different > backends as e.g. foma honors flag diacritics for its lookup while > SFST does not. Variance is not a bad thing here, one of the design decisions in HFST3 is that we can have all sorts of more or less limited backends to library, the missing implementations will raise an exception and then programmer can recover from it (e.g. by converting or, in this case, using composition and extract paths). So I would go for using the underlying functions where possible. Of course where functionality actually differs there should be different functions or signatures, otherwise it would be too confusing, I think. So with flag diacritics and different tokenizations there could be specialized functions. > So would it instead be preferred to follow HFST2 in > using HFST-specific methods for performing lookups and input string > tokenization? As long as it's still possible this can be done as well. But the main HFST-specific lookup we need is most likely the one for optimized lookup transducers. > I'm also wondering what design decisions have been made regarding the > the role of HFST2's Symbol and Key layers in the new library version. > The code currently seems to have traces of key table usage which has > been removed. I think the aim is to reduce the complexity as much as possible, as the symbol and key distinction wasn't used for anything in HFST2 tools. > And what about HfstTransducer's is_trie member > variable? Does it have any relation to the Trie class in > HfstTokenizer.h? is_trie variable is for optimizations. Some algorithms such as union are order of magnitude faster when operating with two trie or trie and path shaped transducers. The trie backing up the default tokenizer is just a light-weight implementation of the data structure with no relation to trie-shaped transducers of the main library; it's probably the fastest and simplest way to perform left-to-right longest match tokenization. -- Tommi A. Pirinen, tietojenkäsittelijälukki sekä kieli-, puhe- ja käännösteknologimestari <http://www.helsinki.fi/~tapirine/> |