From: Gooch, P. <phi...@ts...> - 2009-01-30 09:45:23
|
I've done a similar thing, but this only works for straight apostrophes. If you're dealing with data in ISO-8859-1 or UTF-8 encoding, often you'll have to handle 'curly' apostrophes. GATE seems to have difficulty with these. The only way I've found to reliably catch them is with this: ( {Token.kind == punctuation, Token.position == endpunct} | {Token.kind == punctuation, Token.string == "'"} ) {Token.string == "s"} For curly double quotes, I'm not sure - GATE annotates them as DEFAULT_TOKEN. Phil -----Original Message----- From: Philip A Grim II [mailto:pg...@da...] Sent: 30 January 2009 04:46 To: gat...@li... Subject: Re: [gate-users] how to include apostrophes in words A Jape rule like this would do the trick. Rule: Contraction { {Token.kind == word} {Token.string == "'"} {Token.string =="s"} } :contraction --> { gate.AnnotationSet contraction = (gate.AnnotationSet)bindings.get("contraction"); gate.FeatureMap features = Factory.newFeatureMap(); annotations.add(contraction.firstNode(), contraction.lastNode(), "contraction", features); } -------------------------- Hi I want to tokenize word having apostrophes. For example with defualt tokenizer doctor's is splitted into two token. I want to treat it as one token. Pls help. Thanks in advance. -- Mehnaz Adnan Ph.D. Candidate, Department of Computer Science-Tamaki University of Auckland ------------------------------------------------------------------------ ------ This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword _______________________________________________ GATE-users mailing list GAT...@li... https://lists.sourceforge.net/lists/listinfo/gate-users ________________________________________________________________________ This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk ________________________________________________________________________ *********************************************************************************************** This email, including any attachment, is confidential and may be legally privileged. If you are not the intended recipient or if you have received this email in error, please inform the sender immediately by reply and delete all copies from your system. Do not retain, copy, disclose, distribute or otherwise use any of its contents. Whilst we have taken reasonable precautions to ensure that this email has been swept for computer viruses, we cannot guarantee that this email does not contain such material and we therefore advise you to carry out your own virus checks. We do not accept liability for any damage or losses sustained as a result of such material. Please note that incoming and outgoing email communications passing through our IT systems may be monitored and/or intercepted by us solely to determine whether the content is business related and compliant with company standards. *********************************************************************************************** The Stationery Office Limited is registered in England No. 3049649 at Clifton House, Worship Street, London, EC2A 2EJ. |