Re: [Flex-devel] My fork of flex, that supports Unicode
flex is a tool for generating scanners
Brought to you by:
wlestes
From: John P. H. <jph...@gm...> - 2014-06-14 08:32:58
|
Does your EBCDIC support allow the specification of the particular EBCDIC code page? If it supports 1047 you have saved me from maintaining my own hack. Once your branch is merged into the base, that is. On 14/06/14 02:38, Will Estes wrote: > On Saturday, 14 June 2014, 2:20 am +0200, Mariusz Pluciński <mpl...@mp...> wrote: > >> I'm writing, because I have recently done some hacking in flex. My >> goal was to add Unicode support (the feature that I see as very >> important these days). Yesterday, I published the result as a fork >> of your repository on Github. It is available under address: >> https://github.com/mplucinski/flex > Awesome. > >> Using my fork, it is possible to write rules in ".l" files that >> capture non-ASCII characters (the source must be encoded in UTF-8). >> Generated scanners are theoretically possible to handle any existing >> character encoding, but for now only ASCII, UTF-8 and EBCDIC are >> available (the last one has been added just for testing purposes). >> The example of such scanner is available in >> https://github.com/mplucinski/flex/blob/master/tests/test-unicode-nr >> : scanner.l is able to properly parse test.input file. > So one of the things I'm doing is rewriting the test suite to use automake's parallel test suite support as it makes the test suite much easier to maintain, assuming I can get all the tests rewritten. I've got about 20 tests rewritten, with the caveat that I'm rewriting the easy tests first. > >> My code does not inherit from version that is in flex's >> "to.do/unicode" directory, as that one uses wchar_t (which I see as > That's an old solution kept around for purely historical reasons until something better comes along. > >> a wrong way), and does not provide any way to deal with various >> character encodings. My version uses char32_t, which makes it >> possible to deal with all 17 Unicode planes on all modern platforms >> (as far as I know, at least one of popular operating systems still >> defines wchar_t as 16-bit long). Also, my version makes it quite >> simple to support virtually any character encoding, even ones with >> variable character size. >> >> What I would like to accomplish, is to get my changes integrated >> with main line of flex project. However, I am aware that my code is >> not yet mature enough - I'm pretty sure there are big and small >> issues with new features, as well as regressions (despite that all >> existing test cases pass). > Starting with the test suite is good though. > >> I currently see a few things that needs an urgent improvement: >> >> - support for more encodings - at least UTF-32 and both variants of >> UTF-16 (these should be builtins, I think). For others, the >> interface for programmer to add his own conversion function would be >> sufficient. Optional support of "iconv" may be another approach. > THis all sounds reasonable, understanding that I'm not anything like a unicode expert. > >> - output binary size - as character classes array now may have up >> to 65536*17+1 elements - on 64-bit platform it gives almost 9 MB of >> data in final binary. Not mentioning intermediate .c file... > Yeah but what are options to reduce the size of the output that don't require a lot of code complexity? > >> - generation speed - I'm not sure if it may be greatly improved, >> but generating even simplest Unicode scanner takes a long time (over >> 30 seconds on my machine). This definitely needs at least profiling. > Sure. > >> I will be working on those issues in upcoming weeks, but meanwhile: >> >> >> What do you think, would it be good idea to introduce my changes >> into official flex release? If yes, may I ask you for looking into >> my commits and point out issues that should be resolved before such >> merge? > The standard thing to do is to submit a pull request and interested persons can comment. > > You'll have a lot of rebasing to do if you aren't following all the 2.6.0 changes, but don't let that hold you back. > >> Your review would be a great help for me. I would be especially >> happy to hear about some corner cases I missed, or solutions that >> does not fit very well with general flex approach. >> >> >> I'm looking forward to hearing from you. > Thanks for your work and interest in flex. We'll see about your changes. > >> (if anyone on the mailing list would like to also take a look there, >> it would be great to hear from you too!) > Amen to this; man hands and all that. > >> >> Regards, >> Mariusz Pluciński >> http://www.mplucinski.com/ |