Thread: [Pyparsing] To parse any language character set
Brought to you by:
ptmcg
From: Ujjaval S. <usm...@gm...> - 2008-10-27 04:39:56
|
Hi everyone, I need to parse strings with mix of English and any other unicode characters from any Asian or European languages. The format of strings is like following: ABC|[any unicode character]|[any unicode character]|XYZ In above string, I have ABC and XYZ as literals which are start and end of the string while '|' is the delimiter for the content in between start and end of the strings. How can I use pyparsing to parse this kind of string? Here in the outcome I should have a list of unicode character strings which are in between ABC and XYZ in a form of list. These strings are separated by '|' in between. Thanks, Ujjaval |
From: Eike W. <eik...@gm...> - 2008-10-27 15:02:06
|
On Monday 27 October 2008, Ujjaval Suthar wrote: > Hi everyone, > > I need to parse strings with mix of English and any other unicode > characters from any Asian or European languages. > > The format of strings is like following: > > ABC|[any unicode character]|[any unicode character]|XYZ Hello Ujjaval! If I understand your question right, CharsNotIn is the parser you are looking for. I don't see any general problem with Unicode. As you seem somewhat knowledgeable about the requirements of Asian languages, you could maybe propose a parser for words in Asian languages (or even post a patch). Here is an example for CharsNotIn: http://pastebin.com/f7d6a3331 I hope this helped you. Kind regards, Eike. |
From: Eike W. <eik...@gm...> - 2008-10-27 15:30:25
|
On Monday 27 October 2008, Eike Welk wrote: > > Here is an example for CharsNotIn: > http://pastebin.com/f7d6a3331 I just see that pastebin can't correctly work with Asian characters. But I guess you understand how the example was meant anyways. Just paste some Asian characters into the example strings and replace these numbers (HTML entities?) with them. The original characters were taken from Chinese and Japanese I-Pod ads. Kind regards, Eike. |
From: Ujjaval S. <usm...@gm...> - 2008-10-28 02:44:36
|
Hi Eike, Thats exactly what I wanted. Thanks for that. It worked for me. One more question following what I've done which is really stupid.... I wanted to end each text with a new line character. For example: text6 = 'ABC | iöü | 应iöü | XYZ\r' Now to parse such sentence, I changed your parser code to the following: Here, I want to parse this string as a string that starts with 'ABC' followed by '|' and ends with '\r'. I need everything in between with '|' as delimiter in a list including 'XYZ' as last element in this case. start_kw = Keyword('ABC') fieldContents = Optional(CharsNotIn('|'),'') fields = delimitedList(fieldContents, '|', False) fieldSep = Literal('|').suppress() the_parser = (start_kw + fieldSep + fields + Literal('\r').suppress()) I can't get it to work. Could you tell what I am doing wrong? Thanks, On Tue, Oct 28, 2008 at 2:30 AM, Eike Welk <eik...@gm...> wrote: > On Monday 27 October 2008, Eike Welk wrote: > > > > Here is an example for CharsNotIn: > > http://pastebin.com/f7d6a3331 > > I just see that pastebin can't correctly work with Asian characters. > But I guess you understand how the example was meant anyways. Just > paste some Asian characters into the example strings and replace > these numbers (HTML entities?) with them. The original characters > were taken from Chinese and Japanese I-Pod ads. > > Kind regards, > Eike. > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's > challenge > Build the coolest Linux based applications with Moblin SDK & win great > prizes > Grand prize is a trip for two to an Open Source event anywhere in the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > Pyparsing-users mailing list > Pyp...@li... > https://lists.sourceforge.net/lists/listinfo/pyparsing-users > |
From: Eike W. <eik...@gm...> - 2008-10-29 16:18:14
|
Hello Ujjaval! On Tuesday 28 October 2008, you wrote: > Now to parse such sentence, I changed your parser code to the > following: Here, I want to parse this string as a string that > starts with 'ABC' followed by '|' and ends with '\r'. I need > everything in between with '|' as delimiter in a list including > 'XYZ' as last element in this case. Look at: LineEnd() Your parsers normally don't see '\n' because the whitespace is removed by the parsing machinery. If you want to use the end-of-line frequently as an element in your grammar, you could tell Pyparsing that '\n' should not be treated as whitespace: ParserElement.setDefaultWhitespaceChars('\t ') But you have to care for all the newlines youself then, which might become tedious. Look at indentedBlock(...) as an example how Paul (Pyparsing's author) does it. (I use indentedBlock myself.) Kind regards, Eike. |
From: Ujjaval S. <usm...@gm...> - 2008-11-03 01:05:23
|
Hi Eike, Thanks for that. Actually, the reason my grammer was not working is because I had to put \r inside CharsNotIn() where I only had '|'. The did the trick for me. Cheers, On Thu, Oct 30, 2008 at 3:17 AM, Eike Welk <eik...@gm...> wrote: > Hello Ujjaval! > > On Tuesday 28 October 2008, you wrote: > > Now to parse such sentence, I changed your parser code to the > > following: Here, I want to parse this string as a string that > > starts with 'ABC' followed by '|' and ends with '\r'. I need > > everything in between with '|' as delimiter in a list including > > 'XYZ' as last element in this case. > > Look at: > LineEnd() > > Your parsers normally don't see '\n' because the whitespace is removed > by the parsing machinery. If you want to use the end-of-line > frequently as an element in your grammar, you could tell Pyparsing > that '\n' should not be treated as whitespace: > ParserElement.setDefaultWhitespaceChars('\t ') > > But you have to care for all the newlines youself then, which might > become tedious. Look at > indentedBlock(...) > as an example how Paul (Pyparsing's author) does it. (I use > indentedBlock myself.) > > Kind regards, > Eike. > > |