From: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX - 2008-10-24 19:23:03
|
The common-lisp.net folks have been so kind as to set up a Trac instance (ticket manager/wiki/source code browser) for us. It's available at http://trac.common-lisp.net/armedbear/ Unfortunately it's read-only for anybody without a c-l.net account. If you want to create tickets, please mail the list and I (or any of the other people with write access) will take over. If you plan to help us analyse and document parts of the system into the wiki pages, we can see about requesting c-l.net logins. Anybody with an interest in ABCL, please speak up so we can file your defects/enhancement requests/remarks/system analysis/etc into the Trac system. Note: this does not concern J, only the Common Lisp part of the ABCL/J project. Bye, Erik. |
From: Hideo at Y. <hid...@gm...> - 2008-10-25 03:13:04
|
Hi. I'm just learning common lisp now, and abcl is one of the lisp implementations I'm playing with. I like the idea of a lisp running on top of JVM. Not doing any serious work yet. I'd like to ask for utf-8 encoding support. I want to handle Japanese strings, using abcl from slime. I also would like to use cffi. But I'm not sure how hard and/or interesting it is to implement. There is a Japanese natural language processing package written in C and is callable via cffi. It parses sentences and breaks it up into tokens. There are no spaces within Japanese sentences, so parsing requires guessing the word boundaries by looking up a dictionary. Its not an easy task and it requires a rather big software package. Calling foreign functions require building up the stack frame and performing a native x86 function call, so I suppose a JNI call to some custom stack frame building function is necessary. Please add these as feature requests. Hideo. On Sat, 25 Oct 2008 04:22:59 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX wrote: > The common-lisp.net folks have been so kind as to set up a Trac > instance (ticket manager/wiki/source code browser) for us. It's > available at http://trac.common-lisp.net/armedbear/ > > Unfortunately it's read-only for anybody without a c-l.net account. If > you want to create tickets, please mail the list and I (or any of the > other people with write access) will take over. If you plan to help us > analyse and document parts of the system into the wiki pages, we can > see about requesting c-l.net logins. > > > Anybody with an interest in ABCL, please speak up so we can file your > defects/enhancement requests/remarks/system analysis/etc into the Trac > system. > > > Note: this does not concern J, only the Common Lisp part of the ABCL/J > project. > > Bye, > > Erik. > > ------------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's > challenge > Build the coolest Linux based applications with Moblin SDK & win great > prizes > Grand prize is a trip for two to an Open Source event anywhere in the > world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > armedbear-j-devel mailing list > arm...@li... > https://lists.sourceforge.net/lists/listinfo/armedbear-j-devel |
From: Mark E. <ev...@pa...> - 2008-10-25 07:34:29
|
Note to ABCL: supporting JNA is a good way to go about supporting JNI dynamically. If we get the dynamic classpath portion of invoke.lisp working, all the CFFI stuff could theoretically be done at runtime, ie. no need to actually write Makefiles etc. Tersely written from my iPod, which unfortunately likes to top-post as well. On Nov 1, 2008, at 4:13, "Hideo at Yokohama" <hid...@gm... > wrote: > Hi. > > I'm just learning common lisp now, and abcl is one of the lisp > implementations > I'm playing with. I like the idea of a lisp running on top of JVM. > Not doing any serious work yet. > > I'd like to ask for utf-8 encoding support. > I want to handle Japanese strings, using abcl from slime. > > I also would like to use cffi. But I'm not sure how hard and/or > interesting it > is to implement. There is a Japanese natural language processing > package > written in C and is callable via cffi. It parses sentences and > breaks it > up > into tokens. There are no spaces within Japanese sentences, so > parsing > requires guessing the word boundaries by looking up a dictionary. > Its not > an easy task and it requires a rather big software package. > > Calling foreign functions require building up the stack frame and > performing > a native x86 function call, so I suppose a JNI call to some custom > stack > frame building function is necessary. > > Please add these as feature requests. > > Hideo. > > On Sat, 25 Oct 2008 04:22:59 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX > wrote: > >> The common-lisp.net folks have been so kind as to set up a Trac >> instance (ticket manager/wiki/source code browser) for us. It's >> available at http://trac.common-lisp.net/armedbear/ >> >> Unfortunately it's read-only for anybody without a c-l.net account. >> If >> you want to create tickets, please mail the list and I (or any of the >> other people with write access) will take over. If you plan to help >> us >> analyse and document parts of the system into the wiki pages, we can >> see about requesting c-l.net logins. >> >> >> Anybody with an interest in ABCL, please speak up so we can file your >> defects/enhancement requests/remarks/system analysis/etc into the >> Trac >> system. >> >> >> Note: this does not concern J, only the Common Lisp part of the >> ABCL/J >> project. >> >> Bye, >> >> Erik. >> >> --- >> --- >> ------------------------------------------------------------------- >> This SF.Net email is sponsored by the Moblin Your Move Developer's >> challenge >> Build the coolest Linux based applications with Moblin SDK & win >> great >> prizes >> Grand prize is a trip for two to an Open Source event anywhere in the >> world >> http://moblin-contest.org/redirect.php?banner_id=100&url=/ >> _______________________________________________ >> armedbear-j-devel mailing list >> arm...@li... >> https://lists.sourceforge.net/lists/listinfo/armedbear-j-devel > > --- > ---------------------------------------------------------------------- > This SF.Net email is sponsored by the Moblin Your Move Developer's > challenge > Build the coolest Linux based applications with Moblin SDK & win > great prizes > Grand prize is a trip for two to an Open Source event anywhere in > the world > http://moblin-contest.org/redirect.php?banner_id=100&url=/ > _______________________________________________ > armedbear-j-devel mailing list > arm...@li... > https://lists.sourceforge.net/lists/listinfo/armedbear-j-devel |
From: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX - 2008-10-25 12:05:22
|
Hi! > I'm just learning common lisp now, and abcl is one of the lisp > implementations > I'm playing with. I like the idea of a lisp running on top of JVM. > Not doing any serious work yet. Welcome! > I'd like to ask for utf-8 encoding support. > I want to handle Japanese strings, using abcl from slime. We've shortly investigated this and filed it as ticket #13. Full impact needs to be investigated, but some of what I've seen looks promising. > I also would like to use cffi. But I'm not sure how hard and/or > interesting it > is to implement. There is a Japanese natural language processing package > written in C and is callable via cffi. It parses sentences and breaks it > up > into tokens. There are no spaces within Japanese sentences, so parsing > requires guessing the word boundaries by looking up a dictionary. Its not > an easy task and it requires a rather big software package. > > Calling foreign functions require building up the stack frame and > performing > a native x86 function call, so I suppose a JNI call to some custom stack > frame building function is necessary. I've filed this as ticket #14, including Mark Evenson's suggestion to use JNA. I didn't know JNA, but from the project page, it looks like a promising solution. > Please add these as feature requests. Yes, I did. If you have any Java skills, maybe you could come to the #abcl IRC channel on irc.freenode.org; we could probably work out the required patch for issue #13 together. That would make for much faster solution. > Hideo. Thanks for your reaction! Bye, Erik. |
From: Hideo at Y. <hid...@gm...> - 2008-10-25 17:07:23
|
I took a look at the tickets. Thanks for adding them! For #13 I looked at FileStream.java . If the _setFilePosition is required even for character streams, I think there is no easy and safe way to implement multiple encoding handling. But since abcl can read from network connections, on which you can't do random access, I believe there is a way to keep the seeking requirement out of the way of character (non-binary) streams. For #14, I didn't know JNA either, but essentially it seems to be just what I thought to be necessary in my first mail --- a generic JNI stub that builds a stack frame and calls an arbitrary C function. Since it builds on top of JNI, you should be able to use it with existing JRE/JDK editions. If you do use JNA, it would become a requisite for the yet-to-be-seen ffi for abcl. BTW, the trac deployment is barely customized yet. Clicking at the trac logo will take you away to the edgewall site. If you have any logo image for the abcl project you might want to put it there and customize trac so that the logo will be linked to the top page of this trac deployment. Regards, Hideo. On Sat, 25 Oct 2008 21:05:16 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX wrote: > Hi! > >> I'm just learning common lisp now, and abcl is one of the lisp >> implementations >> I'm playing with. I like the idea of a lisp running on top of JVM. >> Not doing any serious work yet. > > Welcome! > >> I'd like to ask for utf-8 encoding support. >> I want to handle Japanese strings, using abcl from slime. > > We've shortly investigated this and filed it as ticket #13. Full > impact needs to be investigated, but some of what I've seen looks > promising. > >> I also would like to use cffi. But I'm not sure how hard and/or >> interesting it >> is to implement. There is a Japanese natural language processing >> package >> written in C and is callable via cffi. It parses sentences and breaks >> it >> up >> into tokens. There are no spaces within Japanese sentences, so parsing >> requires guessing the word boundaries by looking up a dictionary. Its >> not >> an easy task and it requires a rather big software package. >> >> Calling foreign functions require building up the stack frame and >> performing >> a native x86 function call, so I suppose a JNI call to some custom stack >> frame building function is necessary. > > I've filed this as ticket #14, including Mark Evenson's suggestion to > use JNA. I didn't know JNA, but from the project page, it looks like a > promising solution. > >> Please add these as feature requests. > > Yes, I did. If you have any Java skills, maybe you could come to the > #abcl IRC channel on irc.freenode.org; we could probably work out the > required patch for issue #13 together. That would make for much faster > solution. > >> Hideo. > > Thanks for your reaction! > > Bye, > > > Erik. -- Opera の革新的メールクライアント: http://jp.opera.com/mail/ |
From: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX - 2008-11-04 20:35:39
|
2008/11/1 Hideo at Yokohama <hid...@gm...>: > > I took a look at the tickets. Thanks for adding them! > > For #13 I looked at FileStream.java . If the _setFilePosition is required > even for > character streams, I think there is no easy and safe way to implement > multiple > encoding handling. > But since abcl can read from network connections, on which you can't do > random access, > I believe there is a way to keep the seeking requirement out of the way of > character (non-binary) streams. I added comments to ticket # 13; I think we could use FileInputStream: it supports a SeekableChannel interface which allows getting and setting of the file position. What would you think about that solution? Bye, Erik. |
From: Hideo at Y. <hid...@gm...> - 2008-11-05 15:26:20
|
Hi. Thanks for your time investigating. Are you talking about java.io.FileInputStream ? Or is it a lisp thing? I've never heard of a SeekableChannel, and could not grep it in the abcl source. (SeekableChannel seems to be a C++ thing, according to Google) If you are talking about the one in JDK, here is my comment : Character code conversion is implemented in InputStreamReader and OutputStreamWriter. To support multiple byte encodings both directions of converters have to maintain some state (i.e. it must have some amount of buffer). So you cant safely seek the underlying stream to a different position. As a result, JDK Readers and Writers (which are character oriented, rather than byte oriented) don't provide seeking functions. For java.io.Reader classes, the closest thing to seek is the 'mark' functionality. You can tell the stream to 'mark' the current position, i.e. remember the current file pos, then read some data, then tell the stream to rewind to the position that was marked. The mark function doesn't tell you what the current byte offset is. It just remembers. You can't give an arbitrary integer to a Reader and tell it to seek to that position. In the lisp world, with my limited lisp knowledge, I guess the safe way to go is to make seeking functionality available only to files that are accessed as raw byte streams. In all other cases, make the seek functions cause an error. Seek is rather hard to use. They appear in programs that are aware of the binary data layout in files. That's not something that everyone does. I am curious how the other lisp implementations handle this. Cheers, Hideo On Wed, 05 Nov 2008 05:35:28 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX wrote: > 2008/11/1 Hideo at Yokohama <hid...@gm...>: >> >> I took a look at the tickets. Thanks for adding them! >> >> For #13 I looked at FileStream.java . If the _setFilePosition is >> required >> even for >> character streams, I think there is no easy and safe way to implement >> multiple >> encoding handling. >> But since abcl can read from network connections, on which you can't do >> random access, >> I believe there is a way to keep the seeking requirement out of the way >> of >> character (non-binary) streams. > > I added comments to ticket # 13; I think we could use FileInputStream: > it supports a SeekableChannel interface which allows getting and > setting of the file position. What would you think about that > solution? > > Bye, > > Erik. |
From: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX - 2008-11-05 20:26:28
|
On Wed, Nov 5, 2008 at 4:27 PM, Hideo at Yokohama <hid...@gm...> wrote: > Hi. Thanks for your time investigating. > > Are you talking about java.io.FileInputStream ? Or is it a lisp thing? > I've never heard of a SeekableChannel, and could not grep it in the abcl > source. > (SeekableChannel seems to be a C++ thing, according to Google) I see I was talking about SeekableByteChannel, which is in the JDK (as of 1.4, I think). > > If you are talking about the one in JDK, here is my comment : > > Character code conversion is implemented in InputStreamReader and > OutputStreamWriter. > To support multiple byte encodings both directions of converters have to > maintain some > state (i.e. it must have some amount of buffer). > So you cant safely seek the underlying stream to a different position. hrm. ok. I understand what you're saying. It's a bit disappointing, but I guess it works for the Java world. There's one other option though, possibly: record how many characters have been read, using a filtering stream. and implementing a .skip() function which skips exactly the number of required characters. > As a result, JDK Readers and Writers (which are character oriented, rather > than byte oriented) > don't provide seeking functions. For java.io.Reader classes, the closest > thing to seek is the > 'mark' functionality. You can tell the stream to 'mark' the current > position, i.e. remember the > current file pos, then read some data, then tell the stream to rewind to the > position that was marked. > The mark function doesn't tell you what the current byte offset is. It just > remembers. > You can't give an arbitrary integer to a Reader and tell it to seek to that > position. > > In the lisp world, with my limited lisp knowledge, I guess the safe way to > go is to > make seeking functionality available only to files that are accessed as raw > byte streams. I'm not sure how other lisps do it, but I think they may just count the file position in bytes. > In all other cases, make the seek functions cause an error. Seek is rather > hard to use. > They appear in programs that are aware of the binary data layout in files. > That's not something that everyone does. > > I am curious how the other lisp implementations handle this. Thanks for your input. If we can't do any better than what we just discussed, how about doing it this way: - Leave most of FileStream intact, meaning it'll still be based on RandomAccessFile - Use RandomAccessFile.getFD() to bind an InputStream/OutputStream to the file - Use InputStreamReader and OutputStreamWriter to read/write character data on the streams. Only in case the streams and reader/writers haven't been used, we'll be able to seek() on the random access file. (Possibly, we can detect this by only creating the reader/writer until actual input is required.) What would you say to this strategy? Will it work? Or not? Bye, Erik. > Cheers, > Hideo > > On Wed, 05 Nov 2008 05:35:28 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX wrote: > >> 2008/11/1 Hideo at Yokohama <hid...@gm...>: >>> >>> I took a look at the tickets. Thanks for adding them! >>> >>> For #13 I looked at FileStream.java . If the _setFilePosition is required >>> even for >>> character streams, I think there is no easy and safe way to implement >>> multiple >>> encoding handling. >>> But since abcl can read from network connections, on which you can't do >>> random access, >>> I believe there is a way to keep the seeking requirement out of the way >>> of >>> character (non-binary) streams. >> >> I added comments to ticket # 13; I think we could use FileInputStream: >> it supports a SeekableChannel interface which allows getting and >> setting of the file position. What would you think about that >> solution? >> >> Bye, >> >> Erik. > |
From: Hideo at Y. <hid...@gm...> - 2008-11-06 00:06:10
|
Hi. Continuing on the stream encoding issue. When I get some time, I'd like to look what other lisps do, what the ansi spec says, but now I will write based on my Java knowledge and experience in general. > I see I was talking about SeekableByteChannel, which is in the JDK (as > of 1.4, I think). OK. I didn't know that one either... JDK has a lot of bloat. > There's one other option though, possibly: record how many characters > have been read, using a filtering stream. and implementing a .skip() > function which skips exactly the number of required characters. I think this plan won't work well for a couple of reasons. 1. The bytes-per-character varies from character to character. Japanese text, --or text in any other Asian character set-- typically have their local (Japanese) characters mixed with ascii range characters. So you can't jump to a file position that is specified in terms of character count. You actually have to check all the bytes starting from the beginning of file upto the position specified. Even for a backward seek (a rewinding seek), recording how many bytes you have read is not enough. You need to see all the bytes and detect all the character boundaries within the byte stream. 2. Keeping in mind that we want to use JDK InputStreamReader and OutputStreamWriter, you can't make them discard the content of their buffers. They wont tell you how many bytes have been consumed within their buffers too. The minimum amount of buffer required to implement those streams are a couple of bytes, but they might have a big buffer, like 8k, and consume that buffer little by little. In that case the position in the underlying stream might be quite different from the position in the converter stream. 3. For that strategy to work, we at least need a encoding converter where the underlying file pos is visible. AFAIK, you can't do that on top of JDK streams, so you are on your own to do the actual encoding conversions. > Thanks for your input. If we can't do any better than what we just > discussed, how about doing it this way: > > - Leave most of FileStream intact, meaning it'll still be based on > RandomAccessFile Sounds ok. > - Use RandomAccessFile.getFD() to bind an InputStream/OutputStream to > the file I'm not sure about this one. I haven't used RandomAccessFile at all. Probably it's ok. > - Use InputStreamReader and OutputStreamWriter to read/write character > data on the streams. > > Only in case the streams and reader/writers haven't been used, we'll > be able to seek() on the random access file. (Possibly, we can detect > this by only creating the reader/writer until actual input is > required.) Sounds ok to me. Cheers, Hideo. On Thu, 06 Nov 2008 05:26:18 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX wrote: > On Wed, Nov 5, 2008 at 4:27 PM, Hideo at Yokohama > <hid...@gm...> wrote: >> Hi. Thanks for your time investigating. >> >> Are you talking about java.io.FileInputStream ? Or is it a lisp thing? >> I've never heard of a SeekableChannel, and could not grep it in the abcl >> source. >> (SeekableChannel seems to be a C++ thing, according to Google) > > I see I was talking about SeekableByteChannel, which is in the JDK (as > of 1.4, I think). > >> >> If you are talking about the one in JDK, here is my comment : >> >> Character code conversion is implemented in InputStreamReader and >> OutputStreamWriter. >> To support multiple byte encodings both directions of converters have to >> maintain some >> state (i.e. it must have some amount of buffer). >> So you cant safely seek the underlying stream to a different position. > > hrm. ok. I understand what you're saying. It's a bit disappointing, > but I guess it works for the Java world. > > There's one other option though, possibly: record how many characters > have been read, using a filtering stream. and implementing a .skip() > function which skips exactly the number of required characters. > >> As a result, JDK Readers and Writers (which are character oriented, >> rather >> than byte oriented) >> don't provide seeking functions. For java.io.Reader classes, the >> closest >> thing to seek is the >> 'mark' functionality. You can tell the stream to 'mark' the current >> position, i.e. remember the >> current file pos, then read some data, then tell the stream to rewind >> to the >> position that was marked. >> The mark function doesn't tell you what the current byte offset is. It >> just >> remembers. >> You can't give an arbitrary integer to a Reader and tell it to seek to >> that >> position. >> >> In the lisp world, with my limited lisp knowledge, I guess the safe way >> to >> go is to >> make seeking functionality available only to files that are accessed as >> raw >> byte streams. > > I'm not sure how other lisps do it, but I think they may just count > the file position in bytes. > >> In all other cases, make the seek functions cause an error. Seek is >> rather >> hard to use. > >> They appear in programs that are aware of the binary data layout in >> files. >> That's not something that everyone does. >> >> I am curious how the other lisp implementations handle this. > > Thanks for your input. If we can't do any better than what we just > discussed, how about doing it this way: > > - Leave most of FileStream intact, meaning it'll still be based on > RandomAccessFile > - Use RandomAccessFile.getFD() to bind an InputStream/OutputStream to > the file > - Use InputStreamReader and OutputStreamWriter to read/write character > data on the streams. > > Only in case the streams and reader/writers haven't been used, we'll > be able to seek() on the random access file. (Possibly, we can detect > this by only creating the reader/writer until actual input is > required.) > > What would you say to this strategy? Will it work? Or not? > > Bye, > > Erik. > >> Cheers, >> Hideo >> >> On Wed, 05 Nov 2008 05:35:28 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >> wrote: >> >>> 2008/11/1 Hideo at Yokohama <hid...@gm...>: >>>> >>>> I took a look at the tickets. Thanks for adding them! >>>> >>>> For #13 I looked at FileStream.java . If the _setFilePosition is >>>> required >>>> even for >>>> character streams, I think there is no easy and safe way to implement >>>> multiple >>>> encoding handling. >>>> But since abcl can read from network connections, on which you can't >>>> do >>>> random access, >>>> I believe there is a way to keep the seeking requirement out of the >>>> way >>>> of >>>> character (non-binary) streams. >>> >>> I added comments to ticket # 13; I think we could use FileInputStream: >>> it supports a SeekableChannel interface which allows getting and >>> setting of the file position. What would you think about that >>> solution? >>> >>> Bye, >>> >>> Erik. >> -- Opera の革新的メールクライアント: http://jp.opera.com/mail/ |
From: Hideo at Y. <hid...@gm...> - 2008-11-07 16:26:59
Attachments:
SeekableWriter.java
SeekableReader.java
|
Received: from sfi-mx-1.v28.ch3.sourceforge.com ([172.29.28.121] helo=mx.sourceforge.net) by 3yr0jf1.ch3.sourceforge.com with esmtp (Exim 4.69) (envelope-from <hid...@gm...>) id 1KyUAl-0002kp-JO for arm...@li...; Fri, 07 Nov 2008 16:26:59 +0000 Received-SPF: pass (29vjzd1.ch3.sourceforge.com: domain of gmail.com designates 209.85.200.170 as permitted sender) client-ip=209.85.200.170; envelope-from=hid...@gm...; helo=wf-out-1314.google.com; Received: from wf-out-1314.google.com ([209.85.200.170]) by 29vjzd1.ch3.sourceforge.com with esmtp (Exim 4.69) id 1KyUAb-0003hO-KK for arm...@li...; Fri, 07 Nov 2008 16:26:59 +0000 Received: by wf-out-1314.google.com with SMTP id 27so1325461wfd.4 for <arm...@li...>; Fri, 07 Nov 2008 08:26:48 -0800 (PST) Received: by 10.142.170.16 with SMTP id s16mr965496wfe.308.1226075208088; Fri, 07 Nov 2008 08:26:48 -0800 (PST) Received: from buchi (ZB095018.ppp.dion.ne.jp [219.125.95.18]) by mx.google.com with ESMTPS id 31sm4316610wff.16.2008.11.07.08.26.45 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 07 Nov 2008 08:26:46 -0800 (PST) Date: Sat, 08 Nov 2008 01:27:45 +0900 To: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX From: "Hideo at Yokohama" <hid...@gm...> Content-Type: multipart/mixed; boundary=----------M5M1cSIgKJocONngP403WQ MIME-Version: 1.0 References: <aea...@ma...> <op.ujw68lis1d8fcv@buchi> <aea...@ma...> <op.ujx9uyzu1d8fcv@buchi> <aea...@ma...> <op.uj5jvfn51d8fcv@buchi> <aea...@ma...> <op.uj57xorw1d8fcv@buchi> Message-ID: <op.uj9b0g1n1d8fcv@buchi> In-Reply-To: <op.uj57xorw1d8fcv@buchi> User-Agent: Opera Mail/9.61 (Win32) X-Spam-Score: -1.0 (-) X-Spam-Report: Spam detection software, running on the system "j6vjzd1.ch3.sourceforge.com", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi. I wrote a pair of java.io.Reader and java.io.Writer subclasses that wraps around a java.io.RandomAccessFile, does encoding/decoding, and is seekable. You can get the current position in the file, as well as setting the position. [...] Content analysis details: (-1.0 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -1.5 SPF_CHECK_PASS SPF reports sender host as permitted sender for sender-domain -0.0 SPF_PASS SPF: sender matches SPF record -0.0 DKIM_VERIFIED Domain Keys Identified Mail: signature passes verification 0.0 DKIM_SIGNED Domain Keys Identified Mail: message has a signature 0.5 AWL AWL: From: address is in the auto white-list X-Headers-End: 1KyUAb-0003hO-KK Cc: "arm...@li..." <arm...@li...> Subject: Re: [j-devel] Availability of Trac X-BeenThere: arm...@li... X-Mailman-Version: 2.1.9 Precedence: list List-Id: J Development Mailing List <armedbear-j-devel.lists.sourceforge.net> List-Unsubscribe: <https://lists.sourceforge.net/lists/listinfo/armedbear-j-devel>, <mailto:arm...@li...?subject=unsubscribe> List-Archive: <http://sourceforge.net/mailarchive/forum.php?forum_name=armedbear-j-devel> List-Post: <mailto:arm...@li...> List-Help: <mailto:arm...@li...?subject=help> List-Subscribe: <https://lists.sourceforge.net/lists/listinfo/armedbear-j-devel>, <mailto:arm...@li...?subject=subscribe> X-List-Received-Date: Fri, 07 Nov 2008 16:26:59 -0000 |
From: Hideo at Y. <hid...@gm...> - 2008-11-07 23:10:48
|
I have a couple of things to add. Before I wrote these classes, I tried Allegro Common Lisp to see what happens when you seek on a character file. The seek operation worked, the position was interpreted in terms of bytes rather than characters. When you seek to a non-character boundary, the read operation would return a bogus character, rather than causing an error. After I wrote these classes, I realized that random reads and writes would be used in combination. A pair of Reader and Writer that communicates well would be needed to do proper buffering. You should be able to write to the file, seek a little bit, then read the file safely. So I'm going to try writing such a set of reader and writer, something like this: public class RandomAccessCharacterFile { public RandomAccessCharacterFile(RandomAccessFile f, String encoding) { ... } public Reader getReader() { ... } public Writer getWriter() { ... } public long position() { ... } public void position(long newPos) { ... } public void close() { ... } } On Sat, 08 Nov 2008 01:27:45 +0900, Hideo at Yokohama <hid...@gm...> wrote: > Hi. > > I wrote a pair of java.io.Reader and java.io.Writer subclasses that > wraps around > a java.io.RandomAccessFile, does encoding/decoding, and is seekable. > > You can get the current position in the file, as well as setting the > position. > > long position(); > void position(long newPosition); > > These methods will first flush internal buffers so the file position > will be accurate. > > You can use these classes for files, and use the standard > InputStreamReader/OutputStreamWriter > pair for socket streams. > > I think these can be incorporated to the abcl streams. > Please take a look when you have time. > > Regards, > Hideo > > On Thu, 06 Nov 2008 09:06:50 +0900, Hideo at Yokohama > <hid...@gm...> wrote: > >> Hi. Continuing on the stream encoding issue. >> >> When I get some time, I'd like to look what other lisps do, what the >> ansi spec says, >> but now I will write based on my Java knowledge and experience in >> general. >> >>> I see I was talking about SeekableByteChannel, which is in the JDK (as >>> of 1.4, I think). >> >> OK. I didn't know that one either... JDK has a lot of bloat. >> >>> There's one other option though, possibly: record how many characters >>> have been read, using a filtering stream. and implementing a .skip() >>> function which skips exactly the number of required characters. >> >> I think this plan won't work well for a couple of reasons. >> 1. The bytes-per-character varies from character to character. >> Japanese text, >> --or text in any other Asian character set-- typically have their local >> (Japanese) >> characters mixed with ascii range characters. So you can't jump to a >> file >> position that is specified in terms of character count. You actually >> have >> to check all the bytes starting from the beginning of file upto the >> position specified. >> Even for a backward seek (a rewinding seek), recording how many bytes >> you have read >> is not enough. You need to see all the bytes and detect all the >> character boundaries >> within the byte stream. >> >> 2. Keeping in mind that we want to use JDK InputStreamReader and >> OutputStreamWriter, >> you can't make them discard the content of their buffers. They wont >> tell you >> how many bytes have been consumed within their buffers too. The >> minimum amount >> of buffer required to implement those streams are a couple of bytes, >> but they might have >> a big buffer, like 8k, and consume that buffer little by little. In >> that case >> the position in the underlying stream might be quite different from the >> position >> in the converter stream. >> >> 3. For that strategy to work, we at least need a encoding converter >> where the >> underlying file pos is visible. AFAIK, you can't do that on top of JDK >> streams, >> so you are on your own to do the actual encoding conversions. >> >>> Thanks for your input. If we can't do any better than what we just >>> discussed, how about doing it this way: >>> >>> - Leave most of FileStream intact, meaning it'll still be based on >>> RandomAccessFile >> >> Sounds ok. >> >>> - Use RandomAccessFile.getFD() to bind an InputStream/OutputStream to >>> the file >> >> I'm not sure about this one. I haven't used RandomAccessFile at all. >> Probably it's ok. >> >>> - Use InputStreamReader and OutputStreamWriter to read/write character >>> data on the streams. >>> >>> Only in case the streams and reader/writers haven't been used, we'll >>> be able to seek() on the random access file. (Possibly, we can detect >>> this by only creating the reader/writer until actual input is >>> required.) >> >> Sounds ok to me. >> >> Cheers, >> Hideo. >> >> >> On Thu, 06 Nov 2008 05:26:18 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >> wrote: >> >>> On Wed, Nov 5, 2008 at 4:27 PM, Hideo at Yokohama >>> <hid...@gm...> wrote: >>>> Hi. Thanks for your time investigating. >>>> >>>> Are you talking about java.io.FileInputStream ? Or is it a lisp thing? >>>> I've never heard of a SeekableChannel, and could not grep it in the >>>> abcl >>>> source. >>>> (SeekableChannel seems to be a C++ thing, according to Google) >>> >>> I see I was talking about SeekableByteChannel, which is in the JDK (as >>> of 1.4, I think). >>> >>>> >>>> If you are talking about the one in JDK, here is my comment : >>>> >>>> Character code conversion is implemented in InputStreamReader and >>>> OutputStreamWriter. >>>> To support multiple byte encodings both directions of converters have >>>> to >>>> maintain some >>>> state (i.e. it must have some amount of buffer). >>>> So you cant safely seek the underlying stream to a different position. >>> >>> hrm. ok. I understand what you're saying. It's a bit disappointing, >>> but I guess it works for the Java world. >>> >>> There's one other option though, possibly: record how many characters >>> have been read, using a filtering stream. and implementing a .skip() >>> function which skips exactly the number of required characters. >>> >>>> As a result, JDK Readers and Writers (which are character oriented, >>>> rather >>>> than byte oriented) >>>> don't provide seeking functions. For java.io.Reader classes, the >>>> closest >>>> thing to seek is the >>>> 'mark' functionality. You can tell the stream to 'mark' the current >>>> position, i.e. remember the >>>> current file pos, then read some data, then tell the stream to rewind >>>> to the >>>> position that was marked. >>>> The mark function doesn't tell you what the current byte offset is. >>>> It just >>>> remembers. >>>> You can't give an arbitrary integer to a Reader and tell it to seek >>>> to that >>>> position. >>>> >>>> In the lisp world, with my limited lisp knowledge, I guess the safe >>>> way to >>>> go is to >>>> make seeking functionality available only to files that are accessed >>>> as raw >>>> byte streams. >>> >>> I'm not sure how other lisps do it, but I think they may just count >>> the file position in bytes. >>> >>>> In all other cases, make the seek functions cause an error. Seek is >>>> rather >>>> hard to use. >>> >>>> They appear in programs that are aware of the binary data layout in >>>> files. >>>> That's not something that everyone does. >>>> >>>> I am curious how the other lisp implementations handle this. >>> >>> Thanks for your input. If we can't do any better than what we just >>> discussed, how about doing it this way: >>> >>> - Leave most of FileStream intact, meaning it'll still be based on >>> RandomAccessFile >>> - Use RandomAccessFile.getFD() to bind an InputStream/OutputStream to >>> the file >>> - Use InputStreamReader and OutputStreamWriter to read/write character >>> data on the streams. >>> >>> Only in case the streams and reader/writers haven't been used, we'll >>> be able to seek() on the random access file. (Possibly, we can detect >>> this by only creating the reader/writer until actual input is >>> required.) >>> >>> What would you say to this strategy? Will it work? Or not? >>> >>> Bye, >>> >>> Erik. >>> >>>> Cheers, >>>> Hideo >>>> >>>> On Wed, 05 Nov 2008 05:35:28 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >>>> wrote: >>>> >>>>> 2008/11/1 Hideo at Yokohama <hid...@gm...>: >>>>>> >>>>>> I took a look at the tickets. Thanks for adding them! >>>>>> >>>>>> For #13 I looked at FileStream.java . If the _setFilePosition is >>>>>> required >>>>>> even for >>>>>> character streams, I think there is no easy and safe way to >>>>>> implement >>>>>> multiple >>>>>> encoding handling. >>>>>> But since abcl can read from network connections, on which you >>>>>> can't do >>>>>> random access, >>>>>> I believe there is a way to keep the seeking requirement out of the >>>>>> way >>>>>> of >>>>>> character (non-binary) streams. >>>>> >>>>> I added comments to ticket # 13; I think we could use >>>>> FileInputStream: >>>>> it supports a SeekableChannel interface which allows getting and >>>>> setting of the file position. What would you think about that >>>>> solution? >>>>> >>>>> Bye, >>>>> >>>>> Erik. >>>> >> >> >> > > > -- Opera の革新的メールクライアント: http://jp.opera.com/mail/ |
From: Hideo at Y. <hid...@gm...> - 2008-11-08 05:27:28
|
Received: from sfi-mx-1.v28.ch3.sourceforge.com ([172.29.28.121] helo=mx.sourceforge.net) by h25xhf1.ch3.sourceforge.com with esmtp (Exim 4.69) (envelope-from <hid...@gm...>) id 1KygM4-0002h2-7H for arm...@li...; Sat, 08 Nov 2008 05:27:28 +0000 Received-SPF: pass (29vjzd1.ch3.sourceforge.com: domain of gmail.com designates 209.85.142.191 as permitted sender) client-ip=209.85.142.191; envelope-from=hid...@gm...; helo=ti-out-0910.google.com; Received: from ti-out-0910.google.com ([209.85.142.191]) by 29vjzd1.ch3.sourceforge.com with esmtp (Exim 4.69) id 1KygLz-00069m-6A for arm...@li...; Sat, 08 Nov 2008 05:27:28 +0000 Received: by ti-out-0910.google.com with SMTP id y6so953674tia.18 for <arm...@li...>; Fri, 07 Nov 2008 21:27:20 -0800 (PST) Received: by 10.110.31.5 with SMTP id e5mr4659778tie.31.1226122040228; Fri, 07 Nov 2008 21:27:20 -0800 (PST) Received: from buchi (ZB095018.ppp.dion.ne.jp [219.125.95.18]) by mx.google.com with ESMTPS id 14sm3444356tim.11.2008.11.07.21.27.16 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 07 Nov 2008 21:27:18 -0800 (PST) Date: Sat, 08 Nov 2008 14:28:17 +0900 To: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX From: "Hideo at Yokohama" <hid...@gm...> Content-Type: multipart/mixed; boundary=----------Tdmxhu0aFyvbTfGahMNpRG MIME-Version: 1.0 References: <aea...@ma...> <op.ujw68lis1d8fcv@buchi> <aea...@ma...> <op.ujx9uyzu1d8fcv@buchi> <aea...@ma...> <op.uj5jvfn51d8fcv@buchi> <aea...@ma...> <op.uj57xorw1d8fcv@buchi> <op.uj9b0g1n1d8fcv@buchi> <op.uj9upnsr1d8fcv@buchi> Message-ID: <op.ukab5dki1d8fcv@buchi> In-Reply-To: <op.uj9upnsr1d8fcv@buchi> User-Agent: Opera Mail/9.61 (Win32) X-Spam-Score: -1.0 (-) X-Spam-Report: Spam detection software, running on the system "g2vjzd1.ch3.sourceforge.com", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: I wrote the classes that I mentioned below. A charset-aware, seekable, Reader-Writer combo. It is essentially a ByteBuffer with a custom Reader and Writer attached to it. I tried to make it as simple as possible, but I had to introduce some state variables. A little bit fragile logic. You can mix operations on the Reader and Writer, and you can seek. [...] Content analysis details: (-1.0 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -1.5 SPF_CHECK_PASS SPF reports sender host as permitted sender for sender-domain -0.0 SPF_PASS SPF: sender matches SPF record -0.0 DKIM_VERIFIED Domain Keys Identified Mail: signature passes verification 0.0 DKIM_SIGNED Domain Keys Identified Mail: message has a signature 0.5 AWL AWL: From: address is in the auto white-list X-Headers-End: 1KygLz-00069m-6A Cc: "arm...@li..." <arm...@li...> Subject: Re: [j-devel] Availability of Trac X-BeenThere: arm...@li... X-Mailman-Version: 2.1.9 Precedence: list List-Id: J Development Mailing List <armedbear-j-devel.lists.sourceforge.net> List-Unsubscribe: <https://lists.sourceforge.net/lists/listinfo/armedbear-j-devel>, <mailto:arm...@li...?subject=unsubscribe> List-Archive: <http://sourceforge.net/mailarchive/forum.php?forum_name=armedbear-j-devel> List-Post: <mailto:arm...@li...> List-Help: <mailto:arm...@li...?subject=help> List-Subscribe: <https://lists.sourceforge.net/lists/listinfo/armedbear-j-devel>, <mailto:arm...@li...?subject=subscribe> X-List-Received-Date: Sat, 08 Nov 2008 05:27:28 -0000 |
From: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX - 2008-11-14 15:42:10
|
Hi Hideo, I haven't forgotten your submission, however, I got stuck on some other work, temporarily. I do have a question though: which license do you want your submission to have? If you don't mind me pointing out, if you choose GNU/Classpath [which is the same as the rest of ABCL], that would be very handy. Bye, Erik. On Sat, Nov 8, 2008 at 6:28 AM, Hideo at Yokohama <hid...@gm...> wrote: > I wrote the classes that I mentioned below. > A charset-aware, seekable, Reader-Writer combo. > It is essentially a ByteBuffer with a custom Reader and Writer attached to > it. > I tried to make it as simple as possible, but I had to introduce some > state variables. A little bit fragile logic. > You can mix operations on the Reader and Writer, and you can seek. > > Since I'm not familiar to the abcl internals, this is not a patch to abcl. > It is just a couple of independently written classes that you could > incorporate > into abcl. > > I done some very simple tests with a couple of UTF-8 japanese files. > > Hope this helps the support for multiple encodings. > > Hideo. > > On Sat, 08 Nov 2008 08:11:37 +0900, Hideo at Yokohama > <hid...@gm...> wrote: > >> I have a couple of things to add. >> >> Before I wrote these classes, I tried Allegro Common Lisp to see what >> happens >> when you seek on a character file. >> >> The seek operation worked, the position was interpreted in terms of bytes >> rather than characters. When you seek to a non-character boundary, the >> read operation would return a bogus character, rather than causing an >> error. >> >> After I wrote these classes, I realized that random reads and writes would >> be used in combination. A pair of Reader and Writer that communicates >> well >> would be needed to do proper buffering. You should be able to write to >> the >> file, seek a little bit, then read the file safely. >> >> So I'm going to try writing such a set of reader and writer, something >> like this: >> >> public class RandomAccessCharacterFile { >> public RandomAccessCharacterFile(RandomAccessFile f, String encoding) { >> ... } >> public Reader getReader() { ... } >> public Writer getWriter() { ... } >> public long position() { ... } >> public void position(long newPos) { ... } >> public void close() { ... } >> } >> >> On Sat, 08 Nov 2008 01:27:45 +0900, Hideo at Yokohama >> <hid...@gm...> wrote: >> >>> Hi. >>> >>> I wrote a pair of java.io.Reader and java.io.Writer subclasses that wraps >>> around >>> a java.io.RandomAccessFile, does encoding/decoding, and is seekable. >>> >>> You can get the current position in the file, as well as setting the >>> position. >>> >>> long position(); >>> void position(long newPosition); >>> >>> These methods will first flush internal buffers so the file position will >>> be accurate. >>> >>> You can use these classes for files, and use the standard >>> InputStreamReader/OutputStreamWriter >>> pair for socket streams. >>> >>> I think these can be incorporated to the abcl streams. >>> Please take a look when you have time. >>> >>> Regards, >>> Hideo >>> >>> On Thu, 06 Nov 2008 09:06:50 +0900, Hideo at Yokohama >>> <hid...@gm...> wrote: >>> >>>> Hi. Continuing on the stream encoding issue. >>>> >>>> When I get some time, I'd like to look what other lisps do, what the >>>> ansi spec says, >>>> but now I will write based on my Java knowledge and experience in >>>> general. >>>> >>>>> I see I was talking about SeekableByteChannel, which is in the JDK (as >>>>> of 1.4, I think). >>>> >>>> OK. I didn't know that one either... JDK has a lot of bloat. >>>> >>>>> There's one other option though, possibly: record how many characters >>>>> have been read, using a filtering stream. and implementing a .skip() >>>>> function which skips exactly the number of required characters. >>>> >>>> I think this plan won't work well for a couple of reasons. >>>> 1. The bytes-per-character varies from character to character. Japanese >>>> text, >>>> --or text in any other Asian character set-- typically have their local >>>> (Japanese) >>>> characters mixed with ascii range characters. So you can't jump to a >>>> file >>>> position that is specified in terms of character count. You actually >>>> have >>>> to check all the bytes starting from the beginning of file upto the >>>> position specified. >>>> Even for a backward seek (a rewinding seek), recording how many bytes >>>> you have read >>>> is not enough. You need to see all the bytes and detect all the >>>> character boundaries >>>> within the byte stream. >>>> >>>> 2. Keeping in mind that we want to use JDK InputStreamReader and >>>> OutputStreamWriter, >>>> you can't make them discard the content of their buffers. They wont >>>> tell you >>>> how many bytes have been consumed within their buffers too. The minimum >>>> amount >>>> of buffer required to implement those streams are a couple of bytes, >>>> but they might have >>>> a big buffer, like 8k, and consume that buffer little by little. In >>>> that case >>>> the position in the underlying stream might be quite different from the >>>> position >>>> in the converter stream. >>>> >>>> 3. For that strategy to work, we at least need a encoding converter >>>> where the >>>> underlying file pos is visible. AFAIK, you can't do that on top of JDK >>>> streams, >>>> so you are on your own to do the actual encoding conversions. >>>> >>>>> Thanks for your input. If we can't do any better than what we just >>>>> discussed, how about doing it this way: >>>>> >>>>> - Leave most of FileStream intact, meaning it'll still be based on >>>>> RandomAccessFile >>>> >>>> Sounds ok. >>>> >>>>> - Use RandomAccessFile.getFD() to bind an InputStream/OutputStream to >>>>> the file >>>> >>>> I'm not sure about this one. I haven't used RandomAccessFile at all. >>>> Probably it's ok. >>>> >>>>> - Use InputStreamReader and OutputStreamWriter to read/write character >>>>> data on the streams. >>>>> >>>>> Only in case the streams and reader/writers haven't been used, we'll >>>>> be able to seek() on the random access file. (Possibly, we can detect >>>>> this by only creating the reader/writer until actual input is >>>>> required.) >>>> >>>> Sounds ok to me. >>>> >>>> Cheers, >>>> Hideo. >>>> >>>> >>>> On Thu, 06 Nov 2008 05:26:18 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >>>> wrote: >>>> >>>>> On Wed, Nov 5, 2008 at 4:27 PM, Hideo at Yokohama >>>>> <hid...@gm...> wrote: >>>>>> >>>>>> Hi. Thanks for your time investigating. >>>>>> >>>>>> Are you talking about java.io.FileInputStream ? Or is it a lisp thing? >>>>>> I've never heard of a SeekableChannel, and could not grep it in the >>>>>> abcl >>>>>> source. >>>>>> (SeekableChannel seems to be a C++ thing, according to Google) >>>>> >>>>> I see I was talking about SeekableByteChannel, which is in the JDK (as >>>>> of 1.4, I think). >>>>> >>>>>> >>>>>> If you are talking about the one in JDK, here is my comment : >>>>>> >>>>>> Character code conversion is implemented in InputStreamReader and >>>>>> OutputStreamWriter. >>>>>> To support multiple byte encodings both directions of converters have >>>>>> to >>>>>> maintain some >>>>>> state (i.e. it must have some amount of buffer). >>>>>> So you cant safely seek the underlying stream to a different position. >>>>> >>>>> hrm. ok. I understand what you're saying. It's a bit disappointing, >>>>> but I guess it works for the Java world. >>>>> >>>>> There's one other option though, possibly: record how many characters >>>>> have been read, using a filtering stream. and implementing a .skip() >>>>> function which skips exactly the number of required characters. >>>>> >>>>>> As a result, JDK Readers and Writers (which are character oriented, >>>>>> rather >>>>>> than byte oriented) >>>>>> don't provide seeking functions. For java.io.Reader classes, the >>>>>> closest >>>>>> thing to seek is the >>>>>> 'mark' functionality. You can tell the stream to 'mark' the current >>>>>> position, i.e. remember the >>>>>> current file pos, then read some data, then tell the stream to rewind >>>>>> to the >>>>>> position that was marked. >>>>>> The mark function doesn't tell you what the current byte offset is. It >>>>>> just >>>>>> remembers. >>>>>> You can't give an arbitrary integer to a Reader and tell it to seek to >>>>>> that >>>>>> position. >>>>>> >>>>>> In the lisp world, with my limited lisp knowledge, I guess the safe >>>>>> way to >>>>>> go is to >>>>>> make seeking functionality available only to files that are accessed >>>>>> as raw >>>>>> byte streams. >>>>> >>>>> I'm not sure how other lisps do it, but I think they may just count >>>>> the file position in bytes. >>>>> >>>>>> In all other cases, make the seek functions cause an error. Seek is >>>>>> rather >>>>>> hard to use. >>>>> >>>>>> They appear in programs that are aware of the binary data layout in >>>>>> files. >>>>>> That's not something that everyone does. >>>>>> >>>>>> I am curious how the other lisp implementations handle this. >>>>> >>>>> Thanks for your input. If we can't do any better than what we just >>>>> discussed, how about doing it this way: >>>>> >>>>> - Leave most of FileStream intact, meaning it'll still be based on >>>>> RandomAccessFile >>>>> - Use RandomAccessFile.getFD() to bind an InputStream/OutputStream to >>>>> the file >>>>> - Use InputStreamReader and OutputStreamWriter to read/write character >>>>> data on the streams. >>>>> >>>>> Only in case the streams and reader/writers haven't been used, we'll >>>>> be able to seek() on the random access file. (Possibly, we can detect >>>>> this by only creating the reader/writer until actual input is >>>>> required.) >>>>> >>>>> What would you say to this strategy? Will it work? Or not? >>>>> >>>>> Bye, >>>>> >>>>> Erik. >>>>> >>>>>> Cheers, >>>>>> Hideo >>>>>> >>>>>> On Wed, 05 Nov 2008 05:35:28 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >>>>>> wrote: >>>>>> >>>>>>> 2008/11/1 Hideo at Yokohama <hid...@gm...>: >>>>>>>> >>>>>>>> I took a look at the tickets. Thanks for adding them! >>>>>>>> >>>>>>>> For #13 I looked at FileStream.java . If the _setFilePosition is >>>>>>>> required >>>>>>>> even for >>>>>>>> character streams, I think there is no easy and safe way to >>>>>>>> implement >>>>>>>> multiple >>>>>>>> encoding handling. >>>>>>>> But since abcl can read from network connections, on which you can't >>>>>>>> do >>>>>>>> random access, >>>>>>>> I believe there is a way to keep the seeking requirement out of the >>>>>>>> way >>>>>>>> of >>>>>>>> character (non-binary) streams. >>>>>>> >>>>>>> I added comments to ticket # 13; I think we could use >>>>>>> FileInputStream: >>>>>>> it supports a SeekableChannel interface which allows getting and >>>>>>> setting of the file position. What would you think about that >>>>>>> solution? >>>>>>> >>>>>>> Bye, >>>>>>> >>>>>>> Erik. >>>>>> >>>> >>>> >>>> >>> >>> >>> >> >> >> > > > > -- > Opera の革新的メールクライアント: http://jp.opera.com/mail/ |
From: Hideo at Y. <hid...@gm...> - 2008-11-14 16:05:57
|
Hi. Good to hear a response. For the license, GNU/Classpath is ok with me. On my part, I've done some experimental fixes to see what happens if UTF-8 is passed to abcl. I've noticed something that might be troublesome. I tried this: (1) Replace the hardcoded "ISO-8859-1" in Stream.java with "UTF-8". (2) Change swank-abcl.lisp so that it will accept 'utf-8-unix as the :coding-system parameter. With just those two modifications I ran slime, and it worked fine. Strings with Japanese text passed from slime was parsed and printed by abcl. However the length function returned the number of bytes, rather than characters of the string. (In most cases a Japanese character is encoded as 3 bytes in UTF-8.) In Allegro Common Lisp, the number of characters were returned. I haven't located where string objects get created within abcl, so I don't know the cause of this yet. I have also been looking at the code of Stream.java, FileStream.java and Socket.java . I tried to modify it to make it accept the external-format argument, but I haven't succeeded yet. Main question is, what should I compare the LispObject that holds the external-format parameter with ? Cheers, Hideo. On Sat, 15 Nov 2008 00:42:03 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX wrote: > Hi Hideo, > > I haven't forgotten your submission, however, I got stuck on some > other work, temporarily. I do have a question though: which license do > you want your submission to have? If you don't mind me pointing out, > if you choose GNU/Classpath [which is the same as the rest of ABCL], > that would be very handy. > > Bye, > > Erik. > > > On Sat, Nov 8, 2008 at 6:28 AM, Hideo at Yokohama > <hid...@gm...> wrote: >> I wrote the classes that I mentioned below. >> A charset-aware, seekable, Reader-Writer combo. >> It is essentially a ByteBuffer with a custom Reader and Writer attached >> to >> it. >> I tried to make it as simple as possible, but I had to introduce some >> state variables. A little bit fragile logic. >> You can mix operations on the Reader and Writer, and you can seek. >> >> Since I'm not familiar to the abcl internals, this is not a patch to >> abcl. >> It is just a couple of independently written classes that you could >> incorporate >> into abcl. >> >> I done some very simple tests with a couple of UTF-8 japanese files. >> >> Hope this helps the support for multiple encodings. >> >> Hideo. >> >> On Sat, 08 Nov 2008 08:11:37 +0900, Hideo at Yokohama >> <hid...@gm...> wrote: >> >>> I have a couple of things to add. >>> >>> Before I wrote these classes, I tried Allegro Common Lisp to see what >>> happens >>> when you seek on a character file. >>> >>> The seek operation worked, the position was interpreted in terms of >>> bytes >>> rather than characters. When you seek to a non-character boundary, the >>> read operation would return a bogus character, rather than causing an >>> error. >>> >>> After I wrote these classes, I realized that random reads and writes >>> would >>> be used in combination. A pair of Reader and Writer that communicates >>> well >>> would be needed to do proper buffering. You should be able to write to >>> the >>> file, seek a little bit, then read the file safely. >>> >>> So I'm going to try writing such a set of reader and writer, something >>> like this: >>> >>> public class RandomAccessCharacterFile { >>> public RandomAccessCharacterFile(RandomAccessFile f, String >>> encoding) { >>> ... } >>> public Reader getReader() { ... } >>> public Writer getWriter() { ... } >>> public long position() { ... } >>> public void position(long newPos) { ... } >>> public void close() { ... } >>> } >>> >>> On Sat, 08 Nov 2008 01:27:45 +0900, Hideo at Yokohama >>> <hid...@gm...> wrote: >>> >>>> Hi. >>>> >>>> I wrote a pair of java.io.Reader and java.io.Writer subclasses that >>>> wraps >>>> around >>>> a java.io.RandomAccessFile, does encoding/decoding, and is seekable. >>>> >>>> You can get the current position in the file, as well as setting the >>>> position. >>>> >>>> long position(); >>>> void position(long newPosition); >>>> >>>> These methods will first flush internal buffers so the file position >>>> will >>>> be accurate. >>>> >>>> You can use these classes for files, and use the standard >>>> InputStreamReader/OutputStreamWriter >>>> pair for socket streams. >>>> >>>> I think these can be incorporated to the abcl streams. >>>> Please take a look when you have time. >>>> >>>> Regards, >>>> Hideo >>>> >>>> On Thu, 06 Nov 2008 09:06:50 +0900, Hideo at Yokohama >>>> <hid...@gm...> wrote: >>>> >>>>> Hi. Continuing on the stream encoding issue. >>>>> >>>>> When I get some time, I'd like to look what other lisps do, what the >>>>> ansi spec says, >>>>> but now I will write based on my Java knowledge and experience in >>>>> general. >>>>> >>>>>> I see I was talking about SeekableByteChannel, which is in the JDK >>>>>> (as >>>>>> of 1.4, I think). >>>>> >>>>> OK. I didn't know that one either... JDK has a lot of bloat. >>>>> >>>>>> There's one other option though, possibly: record how many >>>>>> characters >>>>>> have been read, using a filtering stream. and implementing a .skip() >>>>>> function which skips exactly the number of required characters. >>>>> >>>>> I think this plan won't work well for a couple of reasons. >>>>> 1. The bytes-per-character varies from character to character. >>>>> Japanese >>>>> text, >>>>> --or text in any other Asian character set-- typically have their >>>>> local >>>>> (Japanese) >>>>> characters mixed with ascii range characters. So you can't jump to a >>>>> file >>>>> position that is specified in terms of character count. You actually >>>>> have >>>>> to check all the bytes starting from the beginning of file upto the >>>>> position specified. >>>>> Even for a backward seek (a rewinding seek), recording how many bytes >>>>> you have read >>>>> is not enough. You need to see all the bytes and detect all the >>>>> character boundaries >>>>> within the byte stream. >>>>> >>>>> 2. Keeping in mind that we want to use JDK InputStreamReader and >>>>> OutputStreamWriter, >>>>> you can't make them discard the content of their buffers. They wont >>>>> tell you >>>>> how many bytes have been consumed within their buffers too. The >>>>> minimum >>>>> amount >>>>> of buffer required to implement those streams are a couple of bytes, >>>>> but they might have >>>>> a big buffer, like 8k, and consume that buffer little by little. In >>>>> that case >>>>> the position in the underlying stream might be quite different from >>>>> the >>>>> position >>>>> in the converter stream. >>>>> >>>>> 3. For that strategy to work, we at least need a encoding converter >>>>> where the >>>>> underlying file pos is visible. AFAIK, you can't do that on top of >>>>> JDK >>>>> streams, >>>>> so you are on your own to do the actual encoding conversions. >>>>> >>>>>> Thanks for your input. If we can't do any better than what we just >>>>>> discussed, how about doing it this way: >>>>>> >>>>>> - Leave most of FileStream intact, meaning it'll still be based on >>>>>> RandomAccessFile >>>>> >>>>> Sounds ok. >>>>> >>>>>> - Use RandomAccessFile.getFD() to bind an InputStream/OutputStream >>>>>> to >>>>>> the file >>>>> >>>>> I'm not sure about this one. I haven't used RandomAccessFile at all. >>>>> Probably it's ok. >>>>> >>>>>> - Use InputStreamReader and OutputStreamWriter to read/write >>>>>> character >>>>>> data on the streams. >>>>>> >>>>>> Only in case the streams and reader/writers haven't been used, we'll >>>>>> be able to seek() on the random access file. (Possibly, we can >>>>>> detect >>>>>> this by only creating the reader/writer until actual input is >>>>>> required.) >>>>> >>>>> Sounds ok to me. >>>>> >>>>> Cheers, >>>>> Hideo. >>>>> >>>>> >>>>> On Thu, 06 Nov 2008 05:26:18 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >>>>> wrote: >>>>> >>>>>> On Wed, Nov 5, 2008 at 4:27 PM, Hideo at Yokohama >>>>>> <hid...@gm...> wrote: >>>>>>> >>>>>>> Hi. Thanks for your time investigating. >>>>>>> >>>>>>> Are you talking about java.io.FileInputStream ? Or is it a lisp >>>>>>> thing? >>>>>>> I've never heard of a SeekableChannel, and could not grep it in the >>>>>>> abcl >>>>>>> source. >>>>>>> (SeekableChannel seems to be a C++ thing, according to Google) >>>>>> >>>>>> I see I was talking about SeekableByteChannel, which is in the JDK >>>>>> (as >>>>>> of 1.4, I think). >>>>>> >>>>>>> >>>>>>> If you are talking about the one in JDK, here is my comment : >>>>>>> >>>>>>> Character code conversion is implemented in InputStreamReader and >>>>>>> OutputStreamWriter. >>>>>>> To support multiple byte encodings both directions of converters >>>>>>> have >>>>>>> to >>>>>>> maintain some >>>>>>> state (i.e. it must have some amount of buffer). >>>>>>> So you cant safely seek the underlying stream to a different >>>>>>> position. >>>>>> >>>>>> hrm. ok. I understand what you're saying. It's a bit disappointing, >>>>>> but I guess it works for the Java world. >>>>>> >>>>>> There's one other option though, possibly: record how many >>>>>> characters >>>>>> have been read, using a filtering stream. and implementing a .skip() >>>>>> function which skips exactly the number of required characters. >>>>>> >>>>>>> As a result, JDK Readers and Writers (which are character oriented, >>>>>>> rather >>>>>>> than byte oriented) >>>>>>> don't provide seeking functions. For java.io.Reader classes, the >>>>>>> closest >>>>>>> thing to seek is the >>>>>>> 'mark' functionality. You can tell the stream to 'mark' the current >>>>>>> position, i.e. remember the >>>>>>> current file pos, then read some data, then tell the stream to >>>>>>> rewind >>>>>>> to the >>>>>>> position that was marked. >>>>>>> The mark function doesn't tell you what the current byte offset >>>>>>> is. It >>>>>>> just >>>>>>> remembers. >>>>>>> You can't give an arbitrary integer to a Reader and tell it to >>>>>>> seek to >>>>>>> that >>>>>>> position. >>>>>>> >>>>>>> In the lisp world, with my limited lisp knowledge, I guess the safe >>>>>>> way to >>>>>>> go is to >>>>>>> make seeking functionality available only to files that are >>>>>>> accessed >>>>>>> as raw >>>>>>> byte streams. >>>>>> >>>>>> I'm not sure how other lisps do it, but I think they may just count >>>>>> the file position in bytes. >>>>>> >>>>>>> In all other cases, make the seek functions cause an error. Seek >>>>>>> is >>>>>>> rather >>>>>>> hard to use. >>>>>> >>>>>>> They appear in programs that are aware of the binary data layout in >>>>>>> files. >>>>>>> That's not something that everyone does. >>>>>>> >>>>>>> I am curious how the other lisp implementations handle this. >>>>>> >>>>>> Thanks for your input. If we can't do any better than what we just >>>>>> discussed, how about doing it this way: >>>>>> >>>>>> - Leave most of FileStream intact, meaning it'll still be based on >>>>>> RandomAccessFile >>>>>> - Use RandomAccessFile.getFD() to bind an InputStream/OutputStream >>>>>> to >>>>>> the file >>>>>> - Use InputStreamReader and OutputStreamWriter to read/write >>>>>> character >>>>>> data on the streams. >>>>>> >>>>>> Only in case the streams and reader/writers haven't been used, we'll >>>>>> be able to seek() on the random access file. (Possibly, we can >>>>>> detect >>>>>> this by only creating the reader/writer until actual input is >>>>>> required.) >>>>>> >>>>>> What would you say to this strategy? Will it work? Or not? >>>>>> >>>>>> Bye, >>>>>> >>>>>> Erik. >>>>>> >>>>>>> Cheers, >>>>>>> Hideo >>>>>>> >>>>>>> On Wed, 05 Nov 2008 05:35:28 +0900, XXXXXXXXXXXXXX >>>>>>> XXXXXXXXXXXXXXXXXX >>>>>>> wrote: >>>>>>> >>>>>>>> 2008/11/1 Hideo at Yokohama <hid...@gm...>: >>>>>>>>> >>>>>>>>> I took a look at the tickets. Thanks for adding them! >>>>>>>>> >>>>>>>>> For #13 I looked at FileStream.java . If the _setFilePosition is >>>>>>>>> required >>>>>>>>> even for >>>>>>>>> character streams, I think there is no easy and safe way to >>>>>>>>> implement >>>>>>>>> multiple >>>>>>>>> encoding handling. >>>>>>>>> But since abcl can read from network connections, on which you >>>>>>>>> can't >>>>>>>>> do >>>>>>>>> random access, >>>>>>>>> I believe there is a way to keep the seeking requirement out of >>>>>>>>> the >>>>>>>>> way >>>>>>>>> of >>>>>>>>> character (non-binary) streams. >>>>>>>> >>>>>>>> I added comments to ticket # 13; I think we could use >>>>>>>> FileInputStream: >>>>>>>> it supports a SeekableChannel interface which allows getting and >>>>>>>> setting of the file position. What would you think about that >>>>>>>> solution? >>>>>>>> >>>>>>>> Bye, >>>>>>>> >>>>>>>> Erik. >>>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >>> >> >> >> >> -- >> Opera の革新的メールクライアント: http://jp.opera.com/mail/ -- Opera の革新的メールクライアント: http://jp.opera.com/mail/ |
From: Hideo at Y. <hid...@gm...> - 2008-11-15 04:46:01
|
Received: from sfi-mx-2.v28.ch3.sourceforge.com ([172.29.28.122] helo=mx.sourceforge.net) by 335xhf1.ch3.sourceforge.com with esmtp (Exim 4.69) (envelope-from <hid...@gm...>) id 1L1D2n-0000LF-Iv for arm...@li...; Sat, 15 Nov 2008 04:46:01 +0000 Received-SPF: pass (72vjzd1.ch3.sourceforge.com: domain of gmail.com designates 209.85.142.187 as permitted sender) client-ip=209.85.142.187; envelope-from=hid...@gm...; helo=ti-out-0910.google.com; Received: from ti-out-0910.google.com ([209.85.142.187]) by 72vjzd1.ch3.sourceforge.com with esmtp (Exim 4.69) id 1L1D2d-0004VV-DT for arm...@li...; Sat, 15 Nov 2008 04:46:01 +0000 Received: by ti-out-0910.google.com with SMTP id y6so1118128tia.18 for <arm...@li...>; Fri, 14 Nov 2008 20:45:48 -0800 (PST) Received: by 10.110.47.9 with SMTP id u9mr1965864tiu.47.1226724348821; Fri, 14 Nov 2008 20:45:48 -0800 (PST) Received: from buchi (ZB095018.ppp.dion.ne.jp [219.125.95.18]) by mx.google.com with ESMTPS id 22sm3167625tim.7.2008.11.14.20.45.45 (version=TLSv1/SSLv3 cipher=RC4-MD5); Fri, 14 Nov 2008 20:45:46 -0800 (PST) Date: Sat, 15 Nov 2008 13:45:58 +0900 To: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX From: "Hideo at Yokohama" <hid...@gm...> Content-Type: multipart/mixed; boundary=----------fzHb5P01WaggB7nElYqDYd MIME-Version: 1.0 References: <aea...@ma...> <aea...@ma...> <op.ujx9uyzu1d8fcv@buchi> <aea...@ma...> <op.uj5jvfn51d8fcv@buchi> <aea...@ma...> <op.uj57xorw1d8fcv@buchi> <op.uj9b0g1n1d8fcv@buchi> <op.uj9upnsr1d8fcv@buchi> <op.ukab5dki1d8fcv@buchi> <aea...@ma...> <op.ukl9n5mf1d8fcv@buchi> Message-ID: <op.ukm8uww61d8fcv@buchi> In-Reply-To: <op.ukl9n5mf1d8fcv@buchi> User-Agent: Opera Mail/9.62 (Win32) X-Spam-Score: -1.0 (-) X-Spam-Report: Spam detection software, running on the system "g2vjzd1.ch3.sourceforge.com", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi again. Some follow-up. Looking at the java stream classes in abcl, I thought you might want to mix binary read/writes with character read writes. Attached is a modified version that provides an InputStream and an OutputStream as well as Reader and Writer that are all attached to the same buffer and file. You can mix bytewise and characterwise R/W/Seek operations. [...] Content analysis details: (-1.0 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -1.5 SPF_CHECK_PASS SPF reports sender host as permitted sender for sender-domain -0.0 SPF_PASS SPF: sender matches SPF record -0.0 DKIM_VERIFIED Domain Keys Identified Mail: signature passes verification 0.0 DKIM_SIGNED Domain Keys Identified Mail: message has a signature 0.5 AWL AWL: From: address is in the auto white-list X-Headers-End: 1L1D2d-0004VV-DT Cc: "arm...@li..." <arm...@li...> Subject: Re: [j-devel] Availability of Trac X-BeenThere: arm...@li... X-Mailman-Version: 2.1.9 Precedence: list List-Id: J Development Mailing List <armedbear-j-devel.lists.sourceforge.net> List-Unsubscribe: <https://lists.sourceforge.net/lists/listinfo/armedbear-j-devel>, <mailto:arm...@li...?subject=unsubscribe> List-Archive: <http://sourceforge.net/mailarchive/forum.php?forum_name=armedbear-j-devel> List-Post: <mailto:arm...@li...> List-Help: <mailto:arm...@li...?subject=help> List-Subscribe: <https://lists.sourceforge.net/lists/listinfo/armedbear-j-devel>, <mailto:arm...@li...?subject=subscribe> X-List-Received-Date: Sat, 15 Nov 2008 04:46:01 -0000 |
From: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX - 2008-11-18 19:16:28
|
Hi! I've been working on integrating the files you provided into the FileStream.java file. I really like what I'm seeing so far, however, I would need to be able to "unread" a character. Since you're obviously much better into this code than I am, could you extend the RandomAccessReader or RandomAccessCharacterFile with an 'unreading' function? Unreading makes a character go back into the character buffer to be read off it at the next read again. Presumably, you can dispose off of that character when position() is being called. I need this functionality to be in RACF, because the character should go back into the byte buffer: if I unread a character which wasn't a character, but in fact binary data, I want to be able to get it out with readByte(). Thanks in advance for anything you can do! Bye, Erik. On Sat, Nov 15, 2008 at 5:45 AM, Hideo at Yokohama <hid...@gm...> wrote: > Hi again. Some follow-up. > > Looking at the java stream classes in abcl, I thought you might want to > mix binary read/writes with character read writes. Attached is a modified > version that provides an InputStream and an OutputStream as well as Reader > and Writer > that are all attached to the same buffer and file. > You can mix bytewise and characterwise R/W/Seek operations. > > I also found a bug in my previous version. This I/O work is pretty delicate > and should have a test suite.. I only have a small test program that > doesn't > test automatically, and a human has to stare at the results to figure if it > is working or not. So I refrain from posting that. > > Additionally I've done a bit of refactoring to eliminate a couple of > instance > variables that held buffer state. State variables with a broad scope makes > things > harder to understand. Those variables were introduced to gain a bit of > performance, > but I thought the merit wasn't worth the fragility it brings in. > > Bye, > > Hideo. > > On Sat, 15 Nov 2008 01:05:58 +0900, Hideo at Yokohama > <hid...@gm...> wrote: > >> Hi. >> >> Good to hear a response. >> For the license, GNU/Classpath is ok with me. >> >> On my part, I've done some experimental fixes to see what happens if UTF-8 >> is passed to abcl. I've noticed something that might be troublesome. >> >> I tried this: >> (1) Replace the hardcoded "ISO-8859-1" in Stream.java with "UTF-8". >> (2) Change swank-abcl.lisp so that it will accept 'utf-8-unix as the >> :coding-system parameter. >> >> With just those two modifications I ran slime, and it worked fine. >> Strings with Japanese text passed from slime was parsed and printed by >> abcl. >> >> However the length function returned the number of bytes, rather than >> characters of the string. >> (In most cases a Japanese character is encoded as 3 bytes in UTF-8.) >> In Allegro Common Lisp, the number of characters were returned. >> I haven't located where string objects get created within abcl, so I don't >> know the >> cause of this yet. >> >> >> I have also been looking at the code of Stream.java, FileStream.java and >> Socket.java . >> I tried to modify it to make it accept the external-format argument, but I >> haven't >> succeeded yet. Main question is, what should I compare the LispObject >> that holds >> the external-format parameter with ? >> >> Cheers, >> >> Hideo. >> >> On Sat, 15 Nov 2008 00:42:03 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >> wrote: >> >>> Hi Hideo, >>> >>> I haven't forgotten your submission, however, I got stuck on some >>> other work, temporarily. I do have a question though: which license do >>> you want your submission to have? If you don't mind me pointing out, >>> if you choose GNU/Classpath [which is the same as the rest of ABCL], >>> that would be very handy. >>> >>> Bye, >>> >>> Erik. >>> >>> >>> On Sat, Nov 8, 2008 at 6:28 AM, Hideo at Yokohama >>> <hid...@gm...> wrote: >>>> >>>> I wrote the classes that I mentioned below. >>>> A charset-aware, seekable, Reader-Writer combo. >>>> It is essentially a ByteBuffer with a custom Reader and Writer attached >>>> to >>>> it. >>>> I tried to make it as simple as possible, but I had to introduce some >>>> state variables. A little bit fragile logic. >>>> You can mix operations on the Reader and Writer, and you can seek. >>>> >>>> Since I'm not familiar to the abcl internals, this is not a patch to >>>> abcl. >>>> It is just a couple of independently written classes that you could >>>> incorporate >>>> into abcl. >>>> >>>> I done some very simple tests with a couple of UTF-8 japanese files. >>>> >>>> Hope this helps the support for multiple encodings. >>>> >>>> Hideo. >>>> >>>> On Sat, 08 Nov 2008 08:11:37 +0900, Hideo at Yokohama >>>> <hid...@gm...> wrote: >>>> >>>>> I have a couple of things to add. >>>>> >>>>> Before I wrote these classes, I tried Allegro Common Lisp to see what >>>>> happens >>>>> when you seek on a character file. >>>>> >>>>> The seek operation worked, the position was interpreted in terms of >>>>> bytes >>>>> rather than characters. When you seek to a non-character boundary, the >>>>> read operation would return a bogus character, rather than causing an >>>>> error. >>>>> >>>>> After I wrote these classes, I realized that random reads and writes >>>>> would >>>>> be used in combination. A pair of Reader and Writer that communicates >>>>> well >>>>> would be needed to do proper buffering. You should be able to write to >>>>> the >>>>> file, seek a little bit, then read the file safely. >>>>> >>>>> So I'm going to try writing such a set of reader and writer, something >>>>> like this: >>>>> >>>>> public class RandomAccessCharacterFile { >>>>> public RandomAccessCharacterFile(RandomAccessFile f, String encoding) >>>>> { >>>>> ... } >>>>> public Reader getReader() { ... } >>>>> public Writer getWriter() { ... } >>>>> public long position() { ... } >>>>> public void position(long newPos) { ... } >>>>> public void close() { ... } >>>>> } >>>>> >>>>> On Sat, 08 Nov 2008 01:27:45 +0900, Hideo at Yokohama >>>>> <hid...@gm...> wrote: >>>>> >>>>>> Hi. >>>>>> >>>>>> I wrote a pair of java.io.Reader and java.io.Writer subclasses that >>>>>> wraps >>>>>> around >>>>>> a java.io.RandomAccessFile, does encoding/decoding, and is seekable. >>>>>> >>>>>> You can get the current position in the file, as well as setting the >>>>>> position. >>>>>> >>>>>> long position(); >>>>>> void position(long newPosition); >>>>>> >>>>>> These methods will first flush internal buffers so the file position >>>>>> will >>>>>> be accurate. >>>>>> >>>>>> You can use these classes for files, and use the standard >>>>>> InputStreamReader/OutputStreamWriter >>>>>> pair for socket streams. >>>>>> >>>>>> I think these can be incorporated to the abcl streams. >>>>>> Please take a look when you have time. >>>>>> >>>>>> Regards, >>>>>> Hideo >>>>>> >>>>>> On Thu, 06 Nov 2008 09:06:50 +0900, Hideo at Yokohama >>>>>> <hid...@gm...> wrote: >>>>>> >>>>>>> Hi. Continuing on the stream encoding issue. >>>>>>> >>>>>>> When I get some time, I'd like to look what other lisps do, what the >>>>>>> ansi spec says, >>>>>>> but now I will write based on my Java knowledge and experience in >>>>>>> general. >>>>>>> >>>>>>>> I see I was talking about SeekableByteChannel, which is in the JDK >>>>>>>> (as >>>>>>>> of 1.4, I think). >>>>>>> >>>>>>> OK. I didn't know that one either... JDK has a lot of bloat. >>>>>>> >>>>>>>> There's one other option though, possibly: record how many >>>>>>>> characters >>>>>>>> have been read, using a filtering stream. and implementing a .skip() >>>>>>>> function which skips exactly the number of required characters. >>>>>>> >>>>>>> I think this plan won't work well for a couple of reasons. >>>>>>> 1. The bytes-per-character varies from character to character. >>>>>>> Japanese >>>>>>> text, >>>>>>> --or text in any other Asian character set-- typically have their >>>>>>> local >>>>>>> (Japanese) >>>>>>> characters mixed with ascii range characters. So you can't jump to a >>>>>>> file >>>>>>> position that is specified in terms of character count. You actually >>>>>>> have >>>>>>> to check all the bytes starting from the beginning of file upto the >>>>>>> position specified. >>>>>>> Even for a backward seek (a rewinding seek), recording how many bytes >>>>>>> you have read >>>>>>> is not enough. You need to see all the bytes and detect all the >>>>>>> character boundaries >>>>>>> within the byte stream. >>>>>>> >>>>>>> 2. Keeping in mind that we want to use JDK InputStreamReader and >>>>>>> OutputStreamWriter, >>>>>>> you can't make them discard the content of their buffers. They wont >>>>>>> tell you >>>>>>> how many bytes have been consumed within their buffers too. The >>>>>>> minimum >>>>>>> amount >>>>>>> of buffer required to implement those streams are a couple of bytes, >>>>>>> but they might have >>>>>>> a big buffer, like 8k, and consume that buffer little by little. In >>>>>>> that case >>>>>>> the position in the underlying stream might be quite different from >>>>>>> the >>>>>>> position >>>>>>> in the converter stream. >>>>>>> >>>>>>> 3. For that strategy to work, we at least need a encoding converter >>>>>>> where the >>>>>>> underlying file pos is visible. AFAIK, you can't do that on top of >>>>>>> JDK >>>>>>> streams, >>>>>>> so you are on your own to do the actual encoding conversions. >>>>>>> >>>>>>>> Thanks for your input. If we can't do any better than what we just >>>>>>>> discussed, how about doing it this way: >>>>>>>> >>>>>>>> - Leave most of FileStream intact, meaning it'll still be based on >>>>>>>> RandomAccessFile >>>>>>> >>>>>>> Sounds ok. >>>>>>> >>>>>>>> - Use RandomAccessFile.getFD() to bind an InputStream/OutputStream >>>>>>>> to >>>>>>>> the file >>>>>>> >>>>>>> I'm not sure about this one. I haven't used RandomAccessFile at all. >>>>>>> Probably it's ok. >>>>>>> >>>>>>>> - Use InputStreamReader and OutputStreamWriter to read/write >>>>>>>> character >>>>>>>> data on the streams. >>>>>>>> >>>>>>>> Only in case the streams and reader/writers haven't been used, we'll >>>>>>>> be able to seek() on the random access file. (Possibly, we can >>>>>>>> detect >>>>>>>> this by only creating the reader/writer until actual input is >>>>>>>> required.) >>>>>>> >>>>>>> Sounds ok to me. >>>>>>> >>>>>>> Cheers, >>>>>>> Hideo. >>>>>>> >>>>>>> >>>>>>> On Thu, 06 Nov 2008 05:26:18 +0900, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX >>>>>>> wrote: >>>>>>> >>>>>>>> On Wed, Nov 5, 2008 at 4:27 PM, Hideo at Yokohama >>>>>>>> <hid...@gm...> wrote: >>>>>>>>> >>>>>>>>> Hi. Thanks for your time investigating. >>>>>>>>> >>>>>>>>> Are you talking about java.io.FileInputStream ? Or is it a lisp >>>>>>>>> thing? >>>>>>>>> I've never heard of a SeekableChannel, and could not grep it in the >>>>>>>>> abcl >>>>>>>>> source. >>>>>>>>> (SeekableChannel seems to be a C++ thing, according to Google) >>>>>>>> >>>>>>>> I see I was talking about SeekableByteChannel, which is in the JDK >>>>>>>> (as >>>>>>>> of 1.4, I think). >>>>>>>> >>>>>>>>> >>>>>>>>> If you are talking about the one in JDK, here is my comment : >>>>>>>>> >>>>>>>>> Character code conversion is implemented in InputStreamReader and >>>>>>>>> OutputStreamWriter. >>>>>>>>> To support multiple byte encodings both directions of converters >>>>>>>>> have >>>>>>>>> to >>>>>>>>> maintain some >>>>>>>>> state (i.e. it must have some amount of buffer). >>>>>>>>> So you cant safely seek the underlying stream to a different >>>>>>>>> position. >>>>>>>> >>>>>>>> hrm. ok. I understand what you're saying. It's a bit disappointing, >>>>>>>> but I guess it works for the Java world. >>>>>>>> >>>>>>>> There's one other option though, possibly: record how many >>>>>>>> characters >>>>>>>> have been read, using a filtering stream. and implementing a .skip() >>>>>>>> function which skips exactly the number of required characters. >>>>>>>> >>>>>>>>> As a result, JDK Readers and Writers (which are character oriented, >>>>>>>>> rather >>>>>>>>> than byte oriented) >>>>>>>>> don't provide seeking functions. For java.io.Reader classes, the >>>>>>>>> closest >>>>>>>>> thing to seek is the >>>>>>>>> 'mark' functionality. You can tell the stream to 'mark' the current >>>>>>>>> position, i.e. remember the >>>>>>>>> current file pos, then read some data, then tell the stream to >>>>>>>>> rewind >>>>>>>>> to the >>>>>>>>> position that was marked. >>>>>>>>> The mark function doesn't tell you what the current byte offset is. >>>>>>>>> It >>>>>>>>> just >>>>>>>>> remembers. >>>>>>>>> You can't give an arbitrary integer to a Reader and tell it to seek >>>>>>>>> to >>>>>>>>> that >>>>>>>>> position. >>>>>>>>> >>>>>>>>> In the lisp world, with my limited lisp knowledge, I guess the safe >>>>>>>>> way to >>>>>>>>> go is to >>>>>>>>> make seeking functionality available only to files that are >>>>>>>>> accessed >>>>>>>>> as raw >>>>>>>>> byte streams. >>>>>>>> >>>>>>>> I'm not sure how other lisps do it, but I think they may just count >>>>>>>> the file position in bytes. >>>>>>>> >>>>>>>>> In all other cases, make the seek functions cause an error. Seek >>>>>>>>> is >>>>>>>>> rather >>>>>>>>> hard to use. >>>>>>>> >>>>>>>>> They appear in programs that are aware of the binary data layout in >>>>>>>>> files. >>>>>>>>> That's not something that everyone does. >>>>>>>>> >>>>>>>>> I am curious how the other lisp implementations handle this. >>>>>>>> >>>>>>>> Thanks for your input. If we can't do any better than what we just >>>>>>>> discussed, how about doing it this way: >>>>>>>> >>>>>>>> - Leave most of FileStream intact, meaning it'll still be based on >>>>>>>> RandomAccessFile >>>>>>>> - Use RandomAccessFile.getFD() to bind an InputStream/OutputStream >>>>>>>> to >>>>>>>> the file >>>>>>>> - Use InputStreamReader and OutputStreamWriter to read/write >>>>>>>> character >>>>>>>> data on the streams. >>>>>>>> >>>>>>>> Only in case the streams and reader/writers haven't been used, we'll >>>>>>>> be able to seek() on the random access file. (Possibly, we can >>>>>>>> detect >>>>>>>> this by only creating the reader/writer until actual input is >>>>>>>> required.) >>>>>>>> >>>>>>>> What would you say to this strategy? Will it work? Or not? >>>>>>>> >>>>>>>> Bye, >>>>>>>> >>>>>>>> Erik. >>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Hideo >>>>>>>>> >>>>>>>>> On Wed, 05 Nov 2008 05:35:28 +0900, XXXXXXXXXXXXXX >>>>>>>>> XXXXXXXXXXXXXXXXXX >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> 2008/11/1 Hideo at Yokohama <hid...@gm...>: >>>>>>>>>>> >>>>>>>>>>> I took a look at the tickets. Thanks for adding them! >>>>>>>>>>> >>>>>>>>>>> For #13 I looked at FileStream.java . If the _setFilePosition is >>>>>>>>>>> required >>>>>>>>>>> even for >>>>>>>>>>> character streams, I think there is no easy and safe way to >>>>>>>>>>> implement >>>>>>>>>>> multiple >>>>>>>>>>> encoding handling. >>>>>>>>>>> But since abcl can read from network connections, on which you >>>>>>>>>>> can't >>>>>>>>>>> do >>>>>>>>>>> random access, >>>>>>>>>>> I believe there is a way to keep the seeking requirement out of >>>>>>>>>>> the >>>>>>>>>>> way >>>>>>>>>>> of >>>>>>>>>>> character (non-binary) streams. >>>>>>>>>> >>>>>>>>>> I added comments to ticket # 13; I think we could use >>>>>>>>>> FileInputStream: >>>>>>>>>> it supports a SeekableChannel interface which allows getting and >>>>>>>>>> setting of the file position. What would you think about that >>>>>>>>>> solution? >>>>>>>>>> >>>>>>>>>> Bye, >>>>>>>>>> >>>>>>>>>> Erik. >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Opera の革新的メールクライアント: http://jp.opera.com/mail/ >> >> >> > > > > -- > Opera の革新的メールクライアント: http://jp.opera.com/mail/ |
From: Hideo at Y. <hid...@gm...> - 2008-11-19 16:06:37
Attachments:
RandomAccessCharacterFile.java
|
Received: from sfi-mx-3.v28.ch3.sourceforge.com ([172.29.28.123] helo=mx.sourceforge.net) by 235xhf1.ch3.sourceforge.com with esmtp (Exim 4.69) (envelope-from <hid...@gm...>) id 1L2pZd-0005Jt-7B for arm...@li...; Wed, 19 Nov 2008 16:06:37 +0000 Received-SPF: pass (3b2kzd1.ch3.sourceforge.com: domain of gmail.com designates 209.85.142.186 as permitted sender) client-ip=209.85.142.186; envelope-from=hid...@gm...; helo=ti-out-0910.google.com; Received: from ti-out-0910.google.com ([209.85.142.186]) by 3b2kzd1.ch3.sourceforge.com with esmtp (Exim 4.69) id 1L2pZT-0001cy-0r for arm...@li...; Wed, 19 Nov 2008 16:06:37 +0000 Received: by ti-out-0910.google.com with SMTP id y6so16164tia.18 for <arm...@li...>; Wed, 19 Nov 2008 08:06:24 -0800 (PST) Received: by 10.110.39.20 with SMTP id m20mr1648471tim.45.1227110784622; Wed, 19 Nov 2008 08:06:24 -0800 (PST) Received: from buchi (ZB095018.ppp.dion.ne.jp [219.125.95.18]) by mx.google.com with ESMTPS id a14sm1341031tia.12.2008.11.19.08.06.20 (version=TLSv1/SSLv3 cipher=RC4-MD5); Wed, 19 Nov 2008 08:06:22 -0800 (PST) Date: Thu, 20 Nov 2008 01:06:27 +0900 To: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX From: "Hideo at Yokohama" <hid...@gm...> Content-Type: multipart/mixed; boundary=----------Ddln9TsVxFAstyYu6NiayC MIME-Version: 1.0 References: <aea...@ma...> <op.uj5jvfn51d8fcv@buchi> <aea...@ma...> <op.uj57xorw1d8fcv@buchi> <op.uj9b0g1n1d8fcv@buchi> <op.uj9upnsr1d8fcv@buchi> <op.ukab5dki1d8fcv@buchi> <aea...@ma...> <op.ukl9n5mf1d8fcv@buchi> <op.ukm8uww61d8fcv@buchi> <aea...@ma...> Message-ID: <op.ukvi01ec1d8fcv@buchi> In-Reply-To: <aea...@ma...> User-Agent: Opera Mail/9.62 (Win32) X-Spam-Score: -1.0 (-) X-Spam-Report: Spam detection software, running on the system "b92kzd1.ch3.sourceforge.com", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Erik, No problem. I added an unreadChar and an unreadByte method to RandomAccessCharacterFile. I didn't make changes to the Reader or InputStream, since I didn't want to make a local variation of the Reader interface. [...] Content analysis details: (-1.0 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -1.5 SPF_CHECK_PASS SPF reports sender host as permitted sender for sender-domain -0.0 SPF_PASS SPF: sender matches SPF record -0.0 DKIM_VERIFIED Domain Keys Identified Mail: signature passes verification 0.0 DKIM_SIGNED Domain Keys Identified Mail: message has a signature 0.5 AWL AWL: From: address is in the auto white-list X-Headers-End: 1L2pZT-0001cy-0r Cc: "arm...@li..." <arm...@li...> Subject: Re: [j-devel] Availability of Trac X-BeenThere: arm...@li... X-Mailman-Version: 2.1.9 Precedence: list List-Id: J Development Mailing List <armedbear-j-devel.lists.sourceforge.net> List-Unsubscribe: <https://lists.sourceforge.net/lists/listinfo/armedbear-j-devel>, <mailto:arm...@li...?subject=unsubscribe> List-Archive: <http://sourceforge.net/mailarchive/forum.php?forum_name=armedbear-j-devel> List-Post: <mailto:arm...@li...> List-Help: <mailto:arm...@li...?subject=help> List-Subscribe: <https://lists.sourceforge.net/lists/listinfo/armedbear-j-devel>, <mailto:arm...@li...?subject=subscribe> X-List-Received-Date: Wed, 19 Nov 2008 16:06:37 -0000 |
From: Ville V. <vil...@gm...> - 2008-11-19 18:40:09
|
On Wed, Nov 19, 2008 at 6:06 PM, Hideo at Yokohama <hid...@gm...> wrote: > Option 2: You can unread something different from what you have actually > read. > You will have to answer a couple of more questions for this spec to be > unambiguous. I would say it is a can of worms... > Question 2a : What should happen if you read a character that was 3 bytes > long, but then you unread a character that is only 1 byte long ? > And what should happen in the opposite situation ? > ==> Option 2a-1: Don't care about the length change, just overwrite > the file. ==> The file may become unreadable though. I can't fathom any situation where an unread operation would ever result in writing the unread data to a file. The unread operation is for peeking the stream and putting the peek data back, so it's available for normal consumption by reading the stream. It's not meant to be a way to write to a file. If someone wants to write a file, fine, write to an output stream. Unreading an input stream should IMO not result in data being written to file. Having said that, AFAIK the Option 1 is sufficient. I just wanted to point out that should we implement other options at some point, we probably need not worry about the file becoming corrupt, because the file should not change, even on unread. To reiterate, unread is a trick used for "peek the stream and restore it as it were", nothing else. |
From: Hideo at Y. <hid...@gm...> - 2008-11-20 14:08:19
|
Ville, I totally agree with you. Hideo. On Thu, 20 Nov 2008 03:40:02 +0900, Ville Voutilainen <vil...@gm...> wrote: > On Wed, Nov 19, 2008 at 6:06 PM, Hideo at Yokohama > <hid...@gm...> wrote: >> Option 2: You can unread something different from what you have >> actually >> read. >> You will have to answer a couple of more questions for this spec to be >> unambiguous. I would say it is a can of worms... >> Question 2a : What should happen if you read a character that was 3 >> bytes >> long, but then you unread a character that is only 1 byte long ? >> And what should happen in the opposite situation ? >> ==> Option 2a-1: Don't care about the length change, just overwrite >> the file. ==> The file may become unreadable >> though. > > I can't fathom any situation where an unread operation would ever > result in writing > the unread data to a file. The unread operation is for peeking the > stream and putting > the peek data back, so it's available for normal consumption by > reading the stream. > It's not meant to be a way to write to a file. If someone wants to > write a file, fine, > write to an output stream. Unreading an input stream should IMO not > result > in data being written to file. > > Having said that, AFAIK the Option 1 is sufficient. I just wanted to > point out that > should we implement other options at some point, we probably need not > worry > about the file becoming corrupt, because the file should not change, > even on > unread. To reiterate, unread is a trick used for "peek the stream and > restore it > as it were", nothing else. |