I'm trying to use the modnlp tools with the British National Corpus. The corpus is already in xml format with extensive markup for each word (so it is already tokenized). I'm trying to wrap my head around how I might go about loading it into modnlp. I looked at the example header files that came with modnlp-idx, and they seem to be exactly the same as the text xml files, (just without the text?). Using the indexer seems to want to tokenize the files, but it seems (to the best of my knowledge) that the BNC xml files are already tokenized.
Any suggestions for how I might achieve this? I'd really like to use modnlp for some DDL research in my university courses using the BNC. Any tips would be appreciated!
I've attached a screen shot of a sample xml file from the BNC as well as an included DTD file.
Starting with raw text (or rather, text with minimal xml markup) would be a good idea. modnlp-idx will tokenise anyway, regardless of whether the corpus is already tokenised, tagged etc.
However, the indexer should exclude xml tags from indesing anyway, so from the perspective of the indexer it shouldn't really matter if the text is 'clean' or not.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for your suggestion. I've been trying to index again with raw text (in .xml files without any tags). I've attached an example.
The indexer is saying "Warning: modnlp.idx.database.EmptyFileException: File or URI contains no indexable tokens" for each of my files. In the tutorial, it says to close the indexer and reopen it, which then shows that these files are indexed, but when I run teccli and do a simple text search ("the" or "this") it says "Returned 0 lines matching your query".
Essentially, what I'm trying to figure out is the most simple and minimal way to index a new corpus without the need to analyze sub-corpora. The help file says "It is not essential that data and meta-data be stored in seperate files or encoded in XML." However, it seems I need something... perhaps dtd files, or some changes to the idxmgr.properties file? Are dtd files required, or can I index raw text without them?
I'm trying to use the modnlp tools with the British National Corpus. The corpus is already in xml format with extensive markup for each word (so it is already tokenized). I'm trying to wrap my head around how I might go about loading it into modnlp. I looked at the example header files that came with modnlp-idx, and they seem to be exactly the same as the text xml files, (just without the text?). Using the indexer seems to want to tokenize the files, but it seems (to the best of my knowledge) that the BNC xml files are already tokenized.
Any suggestions for how I might achieve this? I'd really like to use modnlp for some DDL research in my university courses using the BNC. Any tips would be appreciated!
I've attached a screen shot of a sample xml file from the BNC as well as an included DTD file.
Also, here is a link to the corpus files if that helps.
http://ota.ox.ac.uk/desc/2554
I'm thinking perhaps I should start from raw text instead of the already marked-up XML? I'll give it a try with a smaller corpus.
Starting with raw text (or rather, text with minimal xml markup) would be a good idea. modnlp-idx will tokenise anyway, regardless of whether the corpus is already tokenised, tagged etc.
However, the indexer should exclude xml tags from indesing anyway, so from the perspective of the indexer it shouldn't really matter if the text is 'clean' or not.
Thanks for your suggestion. I've been trying to index again with raw text (in .xml files without any tags). I've attached an example.
The indexer is saying "Warning: modnlp.idx.database.EmptyFileException: File or URI contains no indexable tokens" for each of my files. In the tutorial, it says to close the indexer and reopen it, which then shows that these files are indexed, but when I run teccli and do a simple text search ("the" or "this") it says "Returned 0 lines matching your query".
Essentially, what I'm trying to figure out is the most simple and minimal way to index a new corpus without the need to analyze sub-corpora. The help file says "It is not essential that data and meta-data be stored in seperate files or encoded in XML." However, it seems I need something... perhaps dtd files, or some changes to the idxmgr.properties file? Are dtd files required, or can I index raw text without them?
Last edit: Mike Kgu 2019-06-24