From: Hideo at Y. <hid...@gm...> - 2008-11-14 16:05:57
|
Hi. Good to hear a response. For the license, GNU/Classpath is ok with me. On my part, I've done some experimental fixes to see what happens if UTF-8 is passed to abcl. I've noticed something that might be troublesome. I tried this: (1) Replace the hardcoded "ISO-8859-1" in Stream.java with "UTF-8". (2) Change swank-abcl.lisp so that it will accept 'utf-8-unix as the :coding-system parameter. With just those two modifications I ran slime, and it worked fine. Strings with Japanese text passed from slime was parsed and printed by abcl. However the length function returned the number of bytes, rather than characters of the string. (In most cases a Japanese character is encoded as 3 bytes in UTF-8.) In Allegro Common Lisp, the number of characters were returned. I haven't located where string objects get created within abcl, so I don't know the cause of this yet. I have also been looking at the code of Stream.java, FileStream.java and Socket.java . I tried to modify it to make it accept the external-format argument, but I haven't succeeded yet. Main question is, what should I compare the LispObject that holds the external-format parameter with ? Cheers, Hideo. On Sat, 15 Nov 2008 00:42:03 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX wrote: > Hi Hideo, > > I haven't forgotten your submission, however, I got stuck on some > other work, temporarily. I do have a question though: which license do > you want your submission to have? If you don't mind me pointing out, > if you choose GNU/Classpath [which is the same as the rest of ABCL], > that would be very handy. > > Bye, > > Erik. > > > On Sat, Nov 8, 2008 at 6:28 AM, Hideo at Yokohama > <hid...@gm...> wrote: >> I wrote the classes that I mentioned below. >> A charset-aware, seekable, Reader-Writer combo. >> It is essentially a ByteBuffer with a custom Reader and Writer attached >> to >> it. >> I tried to make it as simple as possible, but I had to introduce some >> state variables. A little bit fragile logic. >> You can mix operations on the Reader and Writer, and you can seek. >> >> Since I'm not familiar to the abcl internals, this is not a patch to >> abcl. >> It is just a couple of independently written classes that you could >> incorporate >> into abcl. >> >> I done some very simple tests with a couple of UTF-8 japanese files. >> >> Hope this helps the support for multiple encodings. >> >> Hideo. >> >> On Sat, 08 Nov 2008 08:11:37 +0900, Hideo at Yokohama >> <hid...@gm...> wrote: >> >>> I have a couple of things to add. >>> >>> Before I wrote these classes, I tried Allegro Common Lisp to see what >>> happens >>> when you seek on a character file. >>> >>> The seek operation worked, the position was interpreted in terms of >>> bytes >>> rather than characters. When you seek to a non-character boundary, the >>> read operation would return a bogus character, rather than causing an >>> error. >>> >>> After I wrote these classes, I realized that random reads and writes >>> would >>> be used in combination. A pair of Reader and Writer that communicates >>> well >>> would be needed to do proper buffering. You should be able to write to >>> the >>> file, seek a little bit, then read the file safely. >>> >>> So I'm going to try writing such a set of reader and writer, something >>> like this: >>> >>> public class RandomAccessCharacterFile { >>> public RandomAccessCharacterFile(RandomAccessFile f, String >>> encoding) { >>> ... } >>> public Reader getReader() { ... } >>> public Writer getWriter() { ... } >>> public long position() { ... } >>> public void position(long newPos) { ... } >>> public void close() { ... } >>> } >>> >>> On Sat, 08 Nov 2008 01:27:45 +0900, Hideo at Yokohama >>> <hid...@gm...> wrote: >>> >>>> Hi. >>>> >>>> I wrote a pair of java.io.Reader and java.io.Writer subclasses that >>>> wraps >>>> around >>>> a java.io.RandomAccessFile, does encoding/decoding, and is seekable. >>>> >>>> You can get the current position in the file, as well as setting the >>>> position. >>>> >>>> long position(); >>>> void position(long newPosition); >>>> >>>> These methods will first flush internal buffers so the file position >>>> will >>>> be accurate. >>>> >>>> You can use these classes for files, and use the standard >>>> InputStreamReader/OutputStreamWriter >>>> pair for socket streams. >>>> >>>> I think these can be incorporated to the abcl streams. >>>> Please take a look when you have time. >>>> >>>> Regards, >>>> Hideo >>>> >>>> On Thu, 06 Nov 2008 09:06:50 +0900, Hideo at Yokohama >>>> <hid...@gm...> wrote: >>>> >>>>> Hi. Continuing on the stream encoding issue. >>>>> >>>>> When I get some time, I'd like to look what other lisps do, what the >>>>> ansi spec says, >>>>> but now I will write based on my Java knowledge and experience in >>>>> general. >>>>> >>>>>> I see I was talking about SeekableByteChannel, which is in the JDK >>>>>> (as >>>>>> of 1.4, I think). >>>>> >>>>> OK. I didn't know that one either... JDK has a lot of bloat. >>>>> >>>>>> There's one other option though, possibly: record how many >>>>>> characters >>>>>> have been read, using a filtering stream. and implementing a .skip() >>>>>> function which skips exactly the number of required characters. >>>>> >>>>> I think this plan won't work well for a couple of reasons. >>>>> 1. The bytes-per-character varies from character to character. >>>>> Japanese >>>>> text, >>>>> --or text in any other Asian character set-- typically have their >>>>> local >>>>> (Japanese) >>>>> characters mixed with ascii range characters. So you can't jump to a >>>>> file >>>>> position that is specified in terms of character count. You actually >>>>> have >>>>> to check all the bytes starting from the beginning of file upto the >>>>> position specified. >>>>> Even for a backward seek (a rewinding seek), recording how many bytes >>>>> you have read >>>>> is not enough. You need to see all the bytes and detect all the >>>>> character boundaries >>>>> within the byte stream. >>>>> >>>>> 2. Keeping in mind that we want to use JDK InputStreamReader and >>>>> OutputStreamWriter, >>>>> you can't make them discard the content of their buffers. They wont >>>>> tell you >>>>> how many bytes have been consumed within their buffers too. The >>>>> minimum >>>>> amount >>>>> of buffer required to implement those streams are a couple of bytes, >>>>> but they might have >>>>> a big buffer, like 8k, and consume that buffer little by little. In >>>>> that case >>>>> the position in the underlying stream might be quite different from >>>>> the >>>>> position >>>>> in the converter stream. >>>>> >>>>> 3. For that strategy to work, we at least need a encoding converter >>>>> where the >>>>> underlying file pos is visible. AFAIK, you can't do that on top of >>>>> JDK >>>>> streams, >>>>> so you are on your own to do the actual encoding conversions. >>>>> >>>>>> Thanks for your input. If we can't do any better than what we just >>>>>> discussed, how about doing it this way: >>>>>> >>>>>> - Leave most of FileStream intact, meaning it'll still be based on >>>>>> RandomAccessFile >>>>> >>>>> Sounds ok. >>>>> >>>>>> - Use RandomAccessFile.getFD() to bind an InputStream/OutputStream >>>>>> to >>>>>> the file >>>>> >>>>> I'm not sure about this one. I haven't used RandomAccessFile at all. >>>>> Probably it's ok. >>>>> >>>>>> - Use InputStreamReader and OutputStreamWriter to read/write >>>>>> character >>>>>> data on the streams. >>>>>> >>>>>> Only in case the streams and reader/writers haven't been used, we'll >>>>>> be able to seek() on the random access file. (Possibly, we can >>>>>> detect >>>>>> this by only creating the reader/writer until actual input is >>>>>> required.) >>>>> >>>>> Sounds ok to me. >>>>> >>>>> Cheers, >>>>> Hideo. >>>>> >>>>> >>>>> On Thu, 06 Nov 2008 05:26:18 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >>>>> wrote: >>>>> >>>>>> On Wed, Nov 5, 2008 at 4:27 PM, Hideo at Yokohama >>>>>> <hid...@gm...> wrote: >>>>>>> >>>>>>> Hi. Thanks for your time investigating. >>>>>>> >>>>>>> Are you talking about java.io.FileInputStream ? Or is it a lisp >>>>>>> thing? >>>>>>> I've never heard of a SeekableChannel, and could not grep it in the >>>>>>> abcl >>>>>>> source. >>>>>>> (SeekableChannel seems to be a C++ thing, according to Google) >>>>>> >>>>>> I see I was talking about SeekableByteChannel, which is in the JDK >>>>>> (as >>>>>> of 1.4, I think). >>>>>> >>>>>>> >>>>>>> If you are talking about the one in JDK, here is my comment : >>>>>>> >>>>>>> Character code conversion is implemented in InputStreamReader and >>>>>>> OutputStreamWriter. >>>>>>> To support multiple byte encodings both directions of converters >>>>>>> have >>>>>>> to >>>>>>> maintain some >>>>>>> state (i.e. it must have some amount of buffer). >>>>>>> So you cant safely seek the underlying stream to a different >>>>>>> position. >>>>>> >>>>>> hrm. ok. I understand what you're saying. It's a bit disappointing, >>>>>> but I guess it works for the Java world. >>>>>> >>>>>> There's one other option though, possibly: record how many >>>>>> characters >>>>>> have been read, using a filtering stream. and implementing a .skip() >>>>>> function which skips exactly the number of required characters. >>>>>> >>>>>>> As a result, JDK Readers and Writers (which are character oriented, >>>>>>> rather >>>>>>> than byte oriented) >>>>>>> don't provide seeking functions. For java.io.Reader classes, the >>>>>>> closest >>>>>>> thing to seek is the >>>>>>> 'mark' functionality. You can tell the stream to 'mark' the current >>>>>>> position, i.e. remember the >>>>>>> current file pos, then read some data, then tell the stream to >>>>>>> rewind >>>>>>> to the >>>>>>> position that was marked. >>>>>>> The mark function doesn't tell you what the current byte offset >>>>>>> is. It >>>>>>> just >>>>>>> remembers. >>>>>>> You can't give an arbitrary integer to a Reader and tell it to >>>>>>> seek to >>>>>>> that >>>>>>> position. >>>>>>> >>>>>>> In the lisp world, with my limited lisp knowledge, I guess the safe >>>>>>> way to >>>>>>> go is to >>>>>>> make seeking functionality available only to files that are >>>>>>> accessed >>>>>>> as raw >>>>>>> byte streams. >>>>>> >>>>>> I'm not sure how other lisps do it, but I think they may just count >>>>>> the file position in bytes. >>>>>> >>>>>>> In all other cases, make the seek functions cause an error. Seek >>>>>>> is >>>>>>> rather >>>>>>> hard to use. >>>>>> >>>>>>> They appear in programs that are aware of the binary data layout in >>>>>>> files. >>>>>>> That's not something that everyone does. >>>>>>> >>>>>>> I am curious how the other lisp implementations handle this. >>>>>> >>>>>> Thanks for your input. If we can't do any better than what we just >>>>>> discussed, how about doing it this way: >>>>>> >>>>>> - Leave most of FileStream intact, meaning it'll still be based on >>>>>> RandomAccessFile >>>>>> - Use RandomAccessFile.getFD() to bind an InputStream/OutputStream >>>>>> to >>>>>> the file >>>>>> - Use InputStreamReader and OutputStreamWriter to read/write >>>>>> character >>>>>> data on the streams. >>>>>> >>>>>> Only in case the streams and reader/writers haven't been used, we'll >>>>>> be able to seek() on the random access file. (Possibly, we can >>>>>> detect >>>>>> this by only creating the reader/writer until actual input is >>>>>> required.) >>>>>> >>>>>> What would you say to this strategy? Will it work? Or not? >>>>>> >>>>>> Bye, >>>>>> >>>>>> Erik. >>>>>> >>>>>>> Cheers, >>>>>>> Hideo >>>>>>> >>>>>>> On Wed, 05 Nov 2008 05:35:28 +0900, XXXXXXXXXXXXXX >>>>>>> XXXXXXXXXXXXXXXXXX >>>>>>> wrote: >>>>>>> >>>>>>>> 2008/11/1 Hideo at Yokohama <hid...@gm...>: >>>>>>>>> >>>>>>>>> I took a look at the tickets. Thanks for adding them! >>>>>>>>> >>>>>>>>> For #13 I looked at FileStream.java . If the _setFilePosition is >>>>>>>>> required >>>>>>>>> even for >>>>>>>>> character streams, I think there is no easy and safe way to >>>>>>>>> implement >>>>>>>>> multiple >>>>>>>>> encoding handling. >>>>>>>>> But since abcl can read from network connections, on which you >>>>>>>>> can't >>>>>>>>> do >>>>>>>>> random access, >>>>>>>>> I believe there is a way to keep the seeking requirement out of >>>>>>>>> the >>>>>>>>> way >>>>>>>>> of >>>>>>>>> character (non-binary) streams. >>>>>>>> >>>>>>>> I added comments to ticket # 13; I think we could use >>>>>>>> FileInputStream: >>>>>>>> it supports a SeekableChannel interface which allows getting and >>>>>>>> setting of the file position. What would you think about that >>>>>>>> solution? >>>>>>>> >>>>>>>> Bye, >>>>>>>> >>>>>>>> Erik. >>>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >>> >> >> >> >> -- >> Opera の革新的メールクライアント: http://jp.opera.com/mail/ -- Opera の革新的メールクライアント: http://jp.opera.com/mail/ |