Erik,
No problem. I added an unreadChar and an unreadByte method to
RandomAccessCharacterFile. I didn't make changes to the Reader or
InputStream,
since I didn't want to make a local variation of the Reader interface.
I could implement this feature with two different specs which has
trade-offs. I chose the spec that I believe is reasonable.
Let me know if you prefer the other one.
Option 1: You can only unread the character that you have actually read.
With this option, the side effect of unreadChar will only a change in
buffer position. This doesn't have any of the problems that Option 2
has. I implemented this option.
Option 2: You can unread something different from what you have actually
read.
You will have to answer a couple of more questions for this spec to be
unambiguous. I would say it is a can of worms...
Question 2a : What should happen if you read a character that was 3
bytes
long, but then you unread a character that is only 1 byte long ?
And what should happen in the opposite situation ?
==> Option 2a-1: Don't care about the length change, just overwrite
the file. ==> The file may become unreadable though.
Option 2a-2: Check that the byte-per-character has not changed.
Raise an exception
Question 2b : Writing to the buffer will cause the buffer to enter
a 'dirty' state, meaning that it will be written back
to the file whenever the buffered window moves within
the file. This could trigger writes to the file even
when you are just parsing the file with some tokenizer
that uses unreadChar. Is this what you want?
==> Option 2b-1: Yes, go ahead and write back to the file.
Option 2b-2: Don't mark the buffer as dirty. ==> this will
lead to hard-to-predict behavior, if you issue
consecutive unreads. The buffer might have been
dirty already due to other methods. In that case
the unread character will be written back to the
file.
Option 2b-3: Check if the unread is actually a modification
or not. Mark the buffer dirty only if a different
value is being unread. ==> Yes I could implement
this but with a bit of performance overhead.
If any of the above is more appropriate than option 1, let me know.
I think you shouldn't choose 2b-2 for any reason. The other options are a
matter of taste.
Similar things can be said to unreadByte. The current implementation
just rewinds the buffer position by one byte.
Hideo.
On Wed, 19 Nov 2008 04:16:22 +0900, Erik Huelsmann <ehuels@...>
wrote:
> Hi!
>
> I've been working on integrating the files you provided into the
> FileStream.java file. I really like what I'm seeing so far, however, I
> would need to be able to "unread" a character. Since you're obviously
> much better into this code than I am, could you extend the
> RandomAccessReader or RandomAccessCharacterFile with an 'unreading'
> function? Unreading makes a character go back into the character
> buffer to be read off it at the next read again.
>
> Presumably, you can dispose off of that character when position() is
> being called.
>
> I need this functionality to be in RACF, because the character should
> go back into the byte buffer: if I unread a character which wasn't a
> character, but in fact binary data, I want to be able to get it out
> with readByte().
>
> Thanks in advance for anything you can do!
>
>
> Bye,
>
> Erik.
> On Sat, Nov 15, 2008 at 5:45 AM, Hideo at Yokohama
> <hideo.at.yokohama@...> wrote:
>> Hi again. Some follow-up.
>>
>> Looking at the java stream classes in abcl, I thought you might want to
>> mix binary read/writes with character read writes. Attached is a
>> modified
>> version that provides an InputStream and an OutputStream as well as
>> Reader
>> and Writer
>> that are all attached to the same buffer and file.
>> You can mix bytewise and characterwise R/W/Seek operations.
>>
>> I also found a bug in my previous version. This I/O work is pretty
>> delicate
>> and should have a test suite.. I only have a small test program that
>> doesn't
>> test automatically, and a human has to stare at the results to figure
>> if it
>> is working or not. So I refrain from posting that.
>>
>> Additionally I've done a bit of refactoring to eliminate a couple of
>> instance
>> variables that held buffer state. State variables with a broad scope
>> makes
>> things
>> harder to understand. Those variables were introduced to gain a bit of
>> performance,
>> but I thought the merit wasn't worth the fragility it brings in.
>>
>> Bye,
>>
>> Hideo.
>>
>> On Sat, 15 Nov 2008 01:05:58 +0900, Hideo at Yokohama
>> <hideo.at.yokohama@...> wrote:
>>
>>> Hi.
>>>
>>> Good to hear a response.
>>> For the license, GNU/Classpath is ok with me.
>>>
>>> On my part, I've done some experimental fixes to see what happens if
>>> UTF-8
>>> is passed to abcl. I've noticed something that might be troublesome.
>>>
>>> I tried this:
>>> (1) Replace the hardcoded "ISO-8859-1" in Stream.java with "UTF-8".
>>> (2) Change swank-abcl.lisp so that it will accept 'utf-8-unix as the
>>> :coding-system parameter.
>>>
>>> With just those two modifications I ran slime, and it worked fine.
>>> Strings with Japanese text passed from slime was parsed and printed by
>>> abcl.
>>>
>>> However the length function returned the number of bytes, rather than
>>> characters of the string.
>>> (In most cases a Japanese character is encoded as 3 bytes in UTF-8.)
>>> In Allegro Common Lisp, the number of characters were returned.
>>> I haven't located where string objects get created within abcl, so I
>>> don't
>>> know the
>>> cause of this yet.
>>>
>>>
>>> I have also been looking at the code of Stream.java, FileStream.java
>>> and
>>> Socket.java .
>>> I tried to modify it to make it accept the external-format argument,
>>> but I
>>> haven't
>>> succeeded yet. Main question is, what should I compare the LispObject
>>> that holds
>>> the external-format parameter with ?
>>>
>>> Cheers,
>>>
>>> Hideo.
>>>
>>> On Sat, 15 Nov 2008 00:42:03 +0900, Erik Huelsmann <ehuels@...>
>>> wrote:
>>>
>>>> Hi Hideo,
>>>>
>>>> I haven't forgotten your submission, however, I got stuck on some
>>>> other work, temporarily. I do have a question though: which license do
>>>> you want your submission to have? If you don't mind me pointing out,
>>>> if you choose GNU/Classpath [which is the same as the rest of ABCL],
>>>> that would be very handy.
>>>>
>>>> Bye,
>>>>
>>>> Erik.
>>>>
>>>>
>>>> On Sat, Nov 8, 2008 at 6:28 AM, Hideo at Yokohama
>>>> <hideo.at.yokohama@...> wrote:
>>>>>
>>>>> I wrote the classes that I mentioned below.
>>>>> A charset-aware, seekable, Reader-Writer combo.
>>>>> It is essentially a ByteBuffer with a custom Reader and Writer
>>>>> attached
>>>>> to
>>>>> it.
>>>>> I tried to make it as simple as possible, but I had to introduce some
>>>>> state variables. A little bit fragile logic.
>>>>> You can mix operations on the Reader and Writer, and you can seek.
>>>>>
>>>>> Since I'm not familiar to the abcl internals, this is not a patch to
>>>>> abcl.
>>>>> It is just a couple of independently written classes that you could
>>>>> incorporate
>>>>> into abcl.
>>>>>
>>>>> I done some very simple tests with a couple of UTF-8 japanese files.
>>>>>
>>>>> Hope this helps the support for multiple encodings.
>>>>>
>>>>> Hideo.
>>>>>
>>>>> On Sat, 08 Nov 2008 08:11:37 +0900, Hideo at Yokohama
>>>>> <hideo.at.yokohama@...> wrote:
>>>>>
>>>>>> I have a couple of things to add.
>>>>>>
>>>>>> Before I wrote these classes, I tried Allegro Common Lisp to see
>>>>>> what
>>>>>> happens
>>>>>> when you seek on a character file.
>>>>>>
>>>>>> The seek operation worked, the position was interpreted in terms of
>>>>>> bytes
>>>>>> rather than characters. When you seek to a non-character boundary,
>>>>>> the
>>>>>> read operation would return a bogus character, rather than causing
>>>>>> an
>>>>>> error.
>>>>>>
>>>>>> After I wrote these classes, I realized that random reads and writes
>>>>>> would
>>>>>> be used in combination. A pair of Reader and Writer that
>>>>>> communicates
>>>>>> well
>>>>>> would be needed to do proper buffering. You should be able to
>>>>>> write to
>>>>>> the
>>>>>> file, seek a little bit, then read the file safely.
>>>>>>
>>>>>> So I'm going to try writing such a set of reader and writer,
>>>>>> something
>>>>>> like this:
>>>>>>
>>>>>> public class RandomAccessCharacterFile {
>>>>>> public RandomAccessCharacterFile(RandomAccessFile f, String
>>>>>> encoding)
>>>>>> {
>>>>>> ... }
>>>>>> public Reader getReader() { ... }
>>>>>> public Writer getWriter() { ... }
>>>>>> public long position() { ... }
>>>>>> public void position(long newPos) { ... }
>>>>>> public void close() { ... }
>>>>>> }
>>>>>>
>>>>>> On Sat, 08 Nov 2008 01:27:45 +0900, Hideo at Yokohama
>>>>>> <hideo.at.yokohama@...> wrote:
>>>>>>
>>>>>>> Hi.
>>>>>>>
>>>>>>> I wrote a pair of java.io.Reader and java.io.Writer subclasses that
>>>>>>> wraps
>>>>>>> around
>>>>>>> a java.io.RandomAccessFile, does encoding/decoding, and is
>>>>>>> seekable.
>>>>>>>
>>>>>>> You can get the current position in the file, as well as setting
>>>>>>> the
>>>>>>> position.
>>>>>>>
>>>>>>> long position();
>>>>>>> void position(long newPosition);
>>>>>>>
>>>>>>> These methods will first flush internal buffers so the file
>>>>>>> position
>>>>>>> will
>>>>>>> be accurate.
>>>>>>>
>>>>>>> You can use these classes for files, and use the standard
>>>>>>> InputStreamReader/OutputStreamWriter
>>>>>>> pair for socket streams.
>>>>>>>
>>>>>>> I think these can be incorporated to the abcl streams.
>>>>>>> Please take a look when you have time.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Hideo
>>>>>>>
>>>>>>> On Thu, 06 Nov 2008 09:06:50 +0900, Hideo at Yokohama
>>>>>>> <hideo.at.yokohama@...> wrote:
>>>>>>>
>>>>>>>> Hi. Continuing on the stream encoding issue.
>>>>>>>>
>>>>>>>> When I get some time, I'd like to look what other lisps do, what
>>>>>>>> the
>>>>>>>> ansi spec says,
>>>>>>>> but now I will write based on my Java knowledge and experience in
>>>>>>>> general.
>>>>>>>>
>>>>>>>>> I see I was talking about SeekableByteChannel, which is in the
>>>>>>>>> JDK
>>>>>>>>> (as
>>>>>>>>> of 1.4, I think).
>>>>>>>>
>>>>>>>> OK. I didn't know that one either... JDK has a lot of bloat.
>>>>>>>>
>>>>>>>>> There's one other option though, possibly: record how many
>>>>>>>>> characters
>>>>>>>>> have been read, using a filtering stream. and implementing a
>>>>>>>>> .skip()
>>>>>>>>> function which skips exactly the number of required characters.
>>>>>>>>
>>>>>>>> I think this plan won't work well for a couple of reasons.
>>>>>>>> 1. The bytes-per-character varies from character to character.
>>>>>>>> Japanese
>>>>>>>> text,
>>>>>>>> --or text in any other Asian character set-- typically have their
>>>>>>>> local
>>>>>>>> (Japanese)
>>>>>>>> characters mixed with ascii range characters. So you can't jump
>>>>>>>> to a
>>>>>>>> file
>>>>>>>> position that is specified in terms of character count. You
>>>>>>>> actually
>>>>>>>> have
>>>>>>>> to check all the bytes starting from the beginning of file upto
>>>>>>>> the
>>>>>>>> position specified.
>>>>>>>> Even for a backward seek (a rewinding seek), recording how many
>>>>>>>> bytes
>>>>>>>> you have read
>>>>>>>> is not enough. You need to see all the bytes and detect all the
>>>>>>>> character boundaries
>>>>>>>> within the byte stream.
>>>>>>>>
>>>>>>>> 2. Keeping in mind that we want to use JDK InputStreamReader and
>>>>>>>> OutputStreamWriter,
>>>>>>>> you can't make them discard the content of their buffers. They
>>>>>>>> wont
>>>>>>>> tell you
>>>>>>>> how many bytes have been consumed within their buffers too. The
>>>>>>>> minimum
>>>>>>>> amount
>>>>>>>> of buffer required to implement those streams are a couple of
>>>>>>>> bytes,
>>>>>>>> but they might have
>>>>>>>> a big buffer, like 8k, and consume that buffer little by little.
>>>>>>>> In
>>>>>>>> that case
>>>>>>>> the position in the underlying stream might be quite different
>>>>>>>> from
>>>>>>>> the
>>>>>>>> position
>>>>>>>> in the converter stream.
>>>>>>>>
>>>>>>>> 3. For that strategy to work, we at least need a encoding
>>>>>>>> converter
>>>>>>>> where the
>>>>>>>> underlying file pos is visible. AFAIK, you can't do that on top
>>>>>>>> of
>>>>>>>> JDK
>>>>>>>> streams,
>>>>>>>> so you are on your own to do the actual encoding conversions.
>>>>>>>>
>>>>>>>>> Thanks for your input. If we can't do any better than what we
>>>>>>>>> just
>>>>>>>>> discussed, how about doing it this way:
>>>>>>>>>
>>>>>>>>> - Leave most of FileStream intact, meaning it'll still be based
>>>>>>>>> on
>>>>>>>>> RandomAccessFile
>>>>>>>>
>>>>>>>> Sounds ok.
>>>>>>>>
>>>>>>>>> - Use RandomAccessFile.getFD() to bind an
>>>>>>>>> InputStream/OutputStream
>>>>>>>>> to
>>>>>>>>> the file
>>>>>>>>
>>>>>>>> I'm not sure about this one. I haven't used RandomAccessFile at
>>>>>>>> all.
>>>>>>>> Probably it's ok.
>>>>>>>>
>>>>>>>>> - Use InputStreamReader and OutputStreamWriter to read/write
>>>>>>>>> character
>>>>>>>>> data on the streams.
>>>>>>>>>
>>>>>>>>> Only in case the streams and reader/writers haven't been used,
>>>>>>>>> we'll
>>>>>>>>> be able to seek() on the random access file. (Possibly, we can
>>>>>>>>> detect
>>>>>>>>> this by only creating the reader/writer until actual input is
>>>>>>>>> required.)
>>>>>>>>
>>>>>>>> Sounds ok to me.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Hideo.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 06 Nov 2008 05:26:18 +0900, Erik Huelsmann
>>>>>>>> <ehuels@...>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Wed, Nov 5, 2008 at 4:27 PM, Hideo at Yokohama
>>>>>>>>> <hideo.at.yokohama@...> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi. Thanks for your time investigating.
>>>>>>>>>>
>>>>>>>>>> Are you talking about java.io.FileInputStream ? Or is it a lisp
>>>>>>>>>> thing?
>>>>>>>>>> I've never heard of a SeekableChannel, and could not grep it in
>>>>>>>>>> the
>>>>>>>>>> abcl
>>>>>>>>>> source.
>>>>>>>>>> (SeekableChannel seems to be a C++ thing, according to Google)
>>>>>>>>>
>>>>>>>>> I see I was talking about SeekableByteChannel, which is in the
>>>>>>>>> JDK
>>>>>>>>> (as
>>>>>>>>> of 1.4, I think).
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If you are talking about the one in JDK, here is my comment :
>>>>>>>>>>
>>>>>>>>>> Character code conversion is implemented in InputStreamReader
>>>>>>>>>> and
>>>>>>>>>> OutputStreamWriter.
>>>>>>>>>> To support multiple byte encodings both directions of converters
>>>>>>>>>> have
>>>>>>>>>> to
>>>>>>>>>> maintain some
>>>>>>>>>> state (i.e. it must have some amount of buffer).
>>>>>>>>>> So you cant safely seek the underlying stream to a different
>>>>>>>>>> position.
>>>>>>>>>
>>>>>>>>> hrm. ok. I understand what you're saying. It's a bit
>>>>>>>>> disappointing,
>>>>>>>>> but I guess it works for the Java world.
>>>>>>>>>
>>>>>>>>> There's one other option though, possibly: record how many
>>>>>>>>> characters
>>>>>>>>> have been read, using a filtering stream. and implementing a
>>>>>>>>> .skip()
>>>>>>>>> function which skips exactly the number of required characters.
>>>>>>>>>
>>>>>>>>>> As a result, JDK Readers and Writers (which are character
>>>>>>>>>> oriented,
>>>>>>>>>> rather
>>>>>>>>>> than byte oriented)
>>>>>>>>>> don't provide seeking functions. For java.io.Reader classes,
>>>>>>>>>> the
>>>>>>>>>> closest
>>>>>>>>>> thing to seek is the
>>>>>>>>>> 'mark' functionality. You can tell the stream to 'mark' the
>>>>>>>>>> current
>>>>>>>>>> position, i.e. remember the
>>>>>>>>>> current file pos, then read some data, then tell the stream to
>>>>>>>>>> rewind
>>>>>>>>>> to the
>>>>>>>>>> position that was marked.
>>>>>>>>>> The mark function doesn't tell you what the current byte offset
>>>>>>>>>> is.
>>>>>>>>>> It
>>>>>>>>>> just
>>>>>>>>>> remembers.
>>>>>>>>>> You can't give an arbitrary integer to a Reader and tell it to
>>>>>>>>>> seek
>>>>>>>>>> to
>>>>>>>>>> that
>>>>>>>>>> position.
>>>>>>>>>>
>>>>>>>>>> In the lisp world, with my limited lisp knowledge, I guess the
>>>>>>>>>> safe
>>>>>>>>>> way to
>>>>>>>>>> go is to
>>>>>>>>>> make seeking functionality available only to files that are
>>>>>>>>>> accessed
>>>>>>>>>> as raw
>>>>>>>>>> byte streams.
>>>>>>>>>
>>>>>>>>> I'm not sure how other lisps do it, but I think they may just
>>>>>>>>> count
>>>>>>>>> the file position in bytes.
>>>>>>>>>
>>>>>>>>>> In all other cases, make the seek functions cause an error.
>>>>>>>>>> Seek
>>>>>>>>>> is
>>>>>>>>>> rather
>>>>>>>>>> hard to use.
>>>>>>>>>
>>>>>>>>>> They appear in programs that are aware of the binary data
>>>>>>>>>> layout in
>>>>>>>>>> files.
>>>>>>>>>> That's not something that everyone does.
>>>>>>>>>>
>>>>>>>>>> I am curious how the other lisp implementations handle this.
>>>>>>>>>
>>>>>>>>> Thanks for your input. If we can't do any better than what we
>>>>>>>>> just
>>>>>>>>> discussed, how about doing it this way:
>>>>>>>>>
>>>>>>>>> - Leave most of FileStream intact, meaning it'll still be based
>>>>>>>>> on
>>>>>>>>> RandomAccessFile
>>>>>>>>> - Use RandomAccessFile.getFD() to bind an
>>>>>>>>> InputStream/OutputStream
>>>>>>>>> to
>>>>>>>>> the file
>>>>>>>>> - Use InputStreamReader and OutputStreamWriter to read/write
>>>>>>>>> character
>>>>>>>>> data on the streams.
>>>>>>>>>
>>>>>>>>> Only in case the streams and reader/writers haven't been used,
>>>>>>>>> we'll
>>>>>>>>> be able to seek() on the random access file. (Possibly, we can
>>>>>>>>> detect
>>>>>>>>> this by only creating the reader/writer until actual input is
>>>>>>>>> required.)
>>>>>>>>>
>>>>>>>>> What would you say to this strategy? Will it work? Or not?
>>>>>>>>>
>>>>>>>>> Bye,
>>>>>>>>>
>>>>>>>>> Erik.
>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Hideo
>>>>>>>>>>
>>>>>>>>>> On Wed, 05 Nov 2008 05:35:28 +0900, Erik Huelsmann
>>>>>>>>>> <ehuels@...>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> 2008/11/1 Hideo at Yokohama <hideo.at.yokohama@...>:
>>>>>>>>>>>>
>>>>>>>>>>>> I took a look at the tickets. Thanks for adding them!
>>>>>>>>>>>>
>>>>>>>>>>>> For #13 I looked at FileStream.java . If the _setFilePosition
>>>>>>>>>>>> is
>>>>>>>>>>>> required
>>>>>>>>>>>> even for
>>>>>>>>>>>> character streams, I think there is no easy and safe way to
>>>>>>>>>>>> implement
>>>>>>>>>>>> multiple
>>>>>>>>>>>> encoding handling.
>>>>>>>>>>>> But since abcl can read from network connections, on which you
>>>>>>>>>>>> can't
>>>>>>>>>>>> do
>>>>>>>>>>>> random access,
>>>>>>>>>>>> I believe there is a way to keep the seeking requirement out
>>>>>>>>>>>> of
>>>>>>>>>>>> the
>>>>>>>>>>>> way
>>>>>>>>>>>> of
>>>>>>>>>>>> character (non-binary) streams.
>>>>>>>>>>>
>>>>>>>>>>> I added comments to ticket # 13; I think we could use
>>>>>>>>>>> FileInputStream:
>>>>>>>>>>> it supports a SeekableChannel interface which allows getting
>>>>>>>>>>> and
>>>>>>>>>>> setting of the file position. What would you think about that
>>>>>>>>>>> solution?
>>>>>>>>>>>
>>>>>>>>>>> Bye,
>>>>>>>>>>>
>>>>>>>>>>> Erik.
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Opera の革新的メールクライアント: http://jp.opera.com/mail/
>>>
>>>
>>>
>>
>>
>>
>> --
>> Opera の革新的メールクライアント: http://jp.opera.com/mail/
--
Opera の革新的メールクライアント: http://jp.opera.com/mail/
|