modnlp / Discussion / General Discussion: Using modnlp with BNC Corpus

Mike Kgu - 2019-05-17

I'm trying to use the modnlp tools with the British National Corpus. The corpus is already in xml format with extensive markup for each word (so it is already tokenized). I'm trying to wrap my head around how I might go about loading it into modnlp. I looked at the example header files that came with modnlp-idx, and they seem to be exactly the same as the text xml files, (just without the text?). Using the indexer seems to want to tokenize the files, but it seems (to the best of my knowledge) that the BNC xml files are already tokenized.

Any suggestions for how I might achieve this? I'd really like to use modnlp for some DDL research in my university courses using the BNC. Any tips would be appreciated!

I've attached a screen shot of a sample xml file from the BNC as well as an included DTD file.

Screenshot 2019-05-17 12.26.27.png

Screenshot 2019-05-17 12.27.52.png

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mike Kgu - 2019-05-17

Also, here is a link to the corpus files if that helps.
http://ota.ox.ac.uk/desc/2554

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mike Kgu - 2019-05-19

I'm thinking perhaps I should start from raw text instead of the already marked-up XML? I'll give it a try with a smaller corpus.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

S Luz - 2019-05-30

Starting with raw text (or rather, text with minimal xml markup) would be a good idea. modnlp-idx will tokenise anyway, regardless of whether the corpus is already tokenised, tagged etc.

However, the indexer should exclude xml tags from indesing anyway, so from the perspective of the indexer it shouldn't really matter if the text is 'clean' or not.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mike Kgu - 2019-06-24
  
  Thanks for your suggestion. I've been trying to index again with raw text (in .xml files without any tags). I've attached an example.
  
  The indexer is saying "Warning: modnlp.idx.database.EmptyFileException: File or URI contains no indexable tokens" for each of my files. In the tutorial, it says to close the indexer and reopen it, which then shows that these files are indexed, but when I run teccli and do a simple text search ("the" or "this") it says "Returned 0 lines matching your query".
  
  Essentially, what I'm trying to figure out is the most simple and minimal way to index a new corpus without the need to analyze sub-corpora. The help file says "It is not essential that data and meta-data be stored in seperate files or encoded in XML." However, it seems I need something... perhaps dtd files, or some changes to the idxmgr.properties file? Are dtd files required, or can I index raw text without them?
  
  Last edit: Mike Kgu 2019-06-24
  
  A00.xml
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Using modnlp with BNC Corpus

Modular Suite of NLP Tools

Forums

Help

Using modnlp with BNC Corpus

Using modnlp with BNC Corpus

Modular Suite of NLP Tools

Forums

Help

Using modnlp with BNC Corpus document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Using modnlp with BNC Corpus