utf8 support

Brought to you by: jeugenepace

#1 utf8 support

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2004-04-02

Created: 2004-04-02

Creator: Mike Linksvayer

Private: No

> glark 帰属―非営利―派生禁止 *.*
Binary file foo_jp.html matches

foo_jp.html is utf8 encoded. This is one case where
grep actually works better than glark -- grep displays
the matching line perfectly, assuming my terminal
handles utf8.

Glark is fantastic, thanks!

Discussion

Mike Linksvayer - 2004-04-02

Logged In: YES
user_id=15056

Note that I didn't type a bunch of escapes on the command
line -- sf turned japanese characters into escapes.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Pace - 2004-04-03

Logged In: YES
user_id=316860

Thanks for the feedback. Could you please attach an input file and a file
containing the string you'd query for? I'm not having success this myself.

glark makes assumptions that one wants to search only text files, and
assumes ASCII. However, the option: "--binary=text" might result in the
behavior you're seeking.

If there's a more elegant way to determine this, I'd like to know. Perhaps the
regular expressions themselves could be examined for non-ASCII characters,
and if found, the ASCII check wouldn't be done.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mike Linksvayer - 2004-04-03

Logged In: YES
user_id=15056

glark --binary=text foo bar.txt

tells me option not understood: --binary=text

I've attached input.utf8.html, which contains a string
matched by query.utf8.txt, which I'll upload momentarily.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mike Linksvayer - 2004-04-03

contains text that matches query.utf8.txt

input.utf8.html

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mike Linksvayer - 2004-04-03

Logged In: YES
user_id=15056

upload file that matches text in input.utf8.html

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mike Linksvayer - 2004-04-03

Logged In: YES
user_id=15056

upload file that matches string in input.utf8.html

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mike Linksvayer - 2004-04-03

matches string in input.utf8.html

query.utf8.txt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mike Linksvayer - 2004-04-03

Logged In: YES
user_id=15056

By the way, file(1) correctly picks out utf-8 encoded text.
Excerpt from its man page:

If a file does not match any of the entries in the
magic
file, it is examined to see if it seems to be a text
file.
ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII
character
sets (such as those used on Macintosh and IBM PC
systems),
UTF-8-encoded Unicode, UTF-16-encoded Unicode, and
EBCDIC
character sets can be distinguished by the
different
ranges and sequences of bytes that constitute
printable
text in each set.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Pace - 2004-04-03

Logged In: YES
user_id=316860

The examples files are working for me, with my limited
settings for Japanese. That is, glark is showing matches
at lines 4, 11, 20, 26, and 200. It seems that the main
problem would be the ASCII test.

My example earlier was incorrect; I meant to suggest the
option "--binary-files=text". However, the function to
detect text files will be updated for UTF-8.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mike Linksvayer - 2004-04-03

Logged In: YES
user_id=15056

The "--binary-files=text" option works well for me. I
cannot explain my stupidity at not looking at usage. Thanks
for your help!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.