Frederic Baroz - 2016-09-11

Hello,

I apologise in advance for the rather long text.

I m wrestling with some NLP techniques for a while now. OpenNLP has been of much help until now, but I still have a problem for sentence detection and custom model training. The wiki is well made, but in my opinion, it is missing some information on how the learnable sentence detector and tokenizer work (we just know this is max-ent based but a few detail stay in my opinion difficult to infer, like what is it looking at exactly and what are the extracted features of the classifier - maybe it is just a bag of word...).

I m an in-hospital physician and a soft engineer in the setting of medical informatics. I write a MD thesis about information retrieval in the clinical domain. Most of NLP works in the biomedical area adress formal literature and only little target patient documentation. This latter type of documents are quite tricky to process in the sense that you may find pretty much any form of narratives. Of course, there are (more or less) well formed paragraphes with gramatical sentences but most of documents are made of lists of diagnosis (bullet to start the sentence, and no period at the end), tables of lab results, and sometimes "notes" that are made of abbreviated words, pseudo-sentences etc. Moreover, extracting text from PDF with Tika sometimes introduces errors in the linearity of the extracted text (text become entangle with some structural elements like "Page 2 of 3", and words are sometimes truncated for obscur reasons). In the end I find myself with raw text documents that contains weird sentences, most of them have no period at the end and no capital letters. A significative amount of theses "sentences" are just made of a few words, out of lab results (eg: "svga-pO2 7.5 kPa" and the next line "svga-pCO2 4.5 kPa" etc).

That being said, my ultimate objective is to perform proper tokenisation for later indexing with solr (and a certain degree of morphological analysis). Sentence detection felt like a good thing to do at first because I still have some "real" paragraphs and medical text contains lots of abbreviations and compound tokens with intra-word punctuation. You may thing of "Dr., Prof. P.D. but there are many more abbreviation that I can't simply make a lexicon of. They also appear quite frequently and tend to be rapidly created as science evolve. There are also quite specific tokens like molecule names (eg: 1,25-OH-Vitamine-D) that may be written in many different forms (eg: 1-25-oh-vitD), since these are clinical text in which a certain degree of freedom is usually accepted.

Because of all these specificites, and because sentence detection usually is an intrinsic part of the tokenisation process, I thought it would be a good idea to actually perform some. I first wrote some heuristic rules but rapidly found that for such complexe text, my idea was doomed. I then tried sentence detection with the shipped-with french models (my text is in french, whatsmore), but the result is not very good and I understand that most of annotated data came from news article that are syntactically very different from my case of use. I finally ended up to train a sentence detector model using my data but rapidly couldnt exactly figure out how to actually transform the "dirty" tika-extracted text into real sentences. Specifically, I dont know if I should delete some part of sentences, add a period at the end even though there is none in the original text. And how about bits text that was introduced into larger bits of text through Tika process? I had no choice than to consider all of this as "new sentences", also because there is a statistical engine behind and I didnt want to introduce any bias into the training data. At the end you can find a few example sentences that I started putting into a training file. No need to tell you that I ran some evaluation and that results are bad in term of recall/precision (about 0.4/0.2 with around 5000 sampled sentences).

So my questions are
When building training file for sentence detection, should I clean the data first and how?
Is there any resources that you could point me to, treating the question of proper data preparation, how sentence detector internally works and how exactly build training data with the correct format in that regard?
Do you think in my case, I may be a better idea to ignore sentence detection completely and directly do tokenization since my text have actually little gramatical sentences?
I read that 15 000 example sentences was probably enough for training the model, is that correct?

Thank you in advance for your time and knowledge!
Frédéric Baroz

PS: here are some "sentences" that I have in my training file:

Médecin chef de service  
Tél. :  +41-22.372.94.22 
Fax :  +41-22.372.94.60 
HOSPITALISATION 
Tél. : +41-22.372.82.43 
Fax : +41-22.372.94.66 
POLICLINIQUE 
Tél. : +41-22.372.94.23 
Fax: +41-22.372.94.70 
Dermatologie générale 
Dermatologie spéciale 
DIAGNOSTICS SECONDAIRES 
• Acutisation d'une insuffisance rénale chronique. 
• Anémie normochrome normocytaire. 
• Hypovitaminose D (61 nmol/l). 
• Souffle aortique systolique à 2/6 et carotidien droit. 
• Douleur à la cheville gauche sur probable arthrose. 
SYNTHESE DE L’HOSPITALISATION ET PRISE EN CHARGE DES PROBLEMES  
Madame est hospitalisée en raison de l’apparition d’un rash érythémateux papulo-vésiculeux symétrique. 
On ne retrouve pas d’introduction de nouveaux médicaments mais une perfusion de Ferinject® quelques jours auparavant.
On ne retrouve pas d’introduction de nouveaux médicaments mais une perfusion de Ferinject® quelques jours auparavant. 
A son arrivée les constantes sont stables. 
Le laboratoire ne retrouve pas de syndrome inflammatoire. 
On note dans l’évolution des lésions, l’apparition d’un érythème pétéchial à 24 heures ne disparaissant pas à la vitropression à la face interne d’hémi-cuisse droite pour lequel nous excluons une vasculite (complément C3 et C4, anticorps anti-S nucléaires et ANCA dans la norme). 
p-glucose  -                
p-protéine C-réactive  -                   
p-sodium mmol/l 136-144              141 143 
p-potassium mmol/l 3.6-4.6              4.0 3.6 
p-chlorure  -                 
p-CO2 total  -              
p-osmolalité  -                
p-osmolalité calculée  -                      
p-trou anionique  -                       
p-trou osmolaire   -                       
p-magnésium total  -              
p-calcium total/corrigé  /  - / -            / / 
p-phosphates  -              
p-urée mmol/l 3.2-7.5              9.5 6.7 
p-créatinine µmol/l 44-80                     98 100 
p-protéines  -                  
p-urates  -                
p-CK totale/p-CK MB  /  - / -                     / / 
p-LDH  -