Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo


Unicode support and a Q about newlines

  • I'm trying to implement a partial parse of C# and in looking through the C# grammer a lot of tokens are defined in unicode terms.  For instance, instead of specifying identifiers as being made up of alphanum + "_".  It is defined as being made up of unicode characters of classes Lu, Ll, Lt, Lm, Lo, or Nl.  In looking at the source code that I have, I can probably ignore the issue but I have no experience with unicode so I'm wondering if that is going to bite me later.  The tool I'm writing will be used by folks all over the world.  Does pyparsing have any support for unicode?

    Also, I'm wondering what pyparsing accepts as valid newline tokens.  Is '\n' sufficient?  At some time or another I've used any one of ['\n', '\r', '\n\r', '\r\n'].  Subversion seems to do funny things with adding the '\r' and I'm pulling the C# code directly from svn.

    BTW, pyparsing is great.  Thanks.

    • Paul McGuire
      Paul McGuire

      Larry -

      Pyparsing includes *some* Unicode support, but I'm sure there is room for improvement.  In late 2004, Gavin Panella sent me some updates to pyparsing to add better Unicode behavior, but I have not done much to overtly add Unicode-specific features.  There are also a couple of example programs that have been sent in, which I include in the pyparsing examples directory, so they may be of some help to you.

      I'm am not very Unicode-savvy, so your description of "classes Lu, Ll, Lt, etc." doesn't really mean much.  If these translate into character sequences, perhaps you could use pyparsing's srange() to define these as standard classes, similar to what is already done for alphas, nums, alphas8bit, etc., and I'd be happy to include these as part of the standard set of helper strings.

      Newlines in pyparsing are purely '\n's, although I've seen Windows apps interpret the 2-byte sequence <CR><LF> as '\n'.

      -- Paul

    • John Beisley
      John Beisley

      I don't know if it's exactly what you want, but you could try opening the file (for reading only) up with python's "U" option to file, e.g:

      myFile = file("somefile", "rU")

      The "U" flag should get python to translate all the newlines in their various forms into "\n".

      • Good suggestion.  I'll do that for code that I read from the file system.  The tool also downloads code into a string directly from svn so it's not a complete solution but I may take a similar approach and do my own pre-filter.