Hello openNLP users/developers,
I really need your help with openNLP NameFinder…Let me explain:
I am writing a drug-entity recogniser (NER for drugs) and i 'm using the openNLP API from Clojure. I don't have annotated text but i do an up-to-date dictionary of drugs (drugbank.xml). The dictionary includes all sorts of information so it needs a bit of preprocessing to extract the names and synonyms. Anyway my problem is twofold:
As i said i don't have annotated text, but i thought maybe i can make one. You see, i do have 383 pharmacology papers in raw text so i thought why not use the names from the dictionary and build regex patterns to replace all occurrences of that entry with the appropriate "<START:drug> drug-name <END>" annotation tag. Now, you may be wondering at this point why on earth am i not using all the names i extracted from drugbank.xml to build a proper openNLP dictionary to do lookup…Well, apart from not being what i'm trying to do here, i've already tried doing that but i got poor results simply because words like "Folic acid" are being tokenized as 2 tokens rather than 1 and thus, the entry "Folic acid" in the dictionary matches no token! Even if it worked thoughi would still have to train a maxent model to recognise drugs that may not exist in the dictionary (brand new for instance). My initial approach was a bit fiddly but it sort of paid of at the end. I now have a small program that expects some text and, for each entry in the dictionary (6707 in total) it finds and replaces any occurrences of that entry in the text with the expected openNLP format for training. Here is where the 1st problem happens. I can tweak my regex pattern to add spaces to the entity tag or not, with the following results :
<START:drug> drug-name <END> (with spaces inside)
causes problems for me because i get nested tags. To understand why think about the words "Folic acid" for example. Lets assume that folic acid is entry 3 in the dictionary and that entry 115 is "acid". First time round it will produce <START:drug> Folic acid <END> but when it processes "acid" it will match the word acid already tagged with "Folic acid". You can see where this is going. If you happen to have a complex compound name and after a while a slightly less complex compound name (maybe a word shorter or something), and then a smaller one, they can easily start to nest, especially when dealing with drug names. Now the easy and straightforward solution to that is to NOT add spaces in the tag like this :
<START:drug>Folic acid<END> (this will NOT match "acid" in later parsing)
I honestly wasn't expecting that to make any difference to the training process but as it turns out it breaks it completely. Exceptions everywhere before it even starts!!! Could please someone explain what happens with those spaces around the entity name? How on earth can they make any difference? I can solve my problem by doing negative lookbehind assertion in my regex but that slows things down quite a bit! remember i'm dealing with 6707 entries, times 383 papers. Clojure's lazy attitude and loop/recur structure sure help a lot…
Ok now on to the 2nd problem…
Even when i manually sort all the nested tags and i finally train a maxent model on the newly automatically annotated papers, i still get very poor results (poorer than the dictionary) and i think i can see why but that contradicts the openNLP documentation and all the examples i 've seen so far.
On the openNLP tutorial it seems perfectly normal to have 2 words inside a tag like:
<START:name> Pierre Vinken <END>
but when the time comes to use the name-finder model you just trained everything has to be tokenized again. Therefore, "Pierre Vinken" becomes 2 tokens and cannot be recognised. Sometimes just the one token may be recognised as an entity but without the rest of the name is not only meaningless but could be misleading. Again think about Folic acid… Neither "folic" nor "acid" are drugs…even if the name-finder recognises "folic", it's of no use!!! In the same way that "Vinken" is not necessarily a name…
Anyway sorry for the massive e-mail but i'm really struggling!
Please help me, i'm at a dead-end at the moment! i've tried literally everything…am i missing anything important?
Thanks in advance…keep up the good work!