utf8 support
Brought to you by:
jeugenepace
> glark 帰属―非営利―派生禁止 *.*
Binary file foo_jp.html matches
foo_jp.html is utf8 encoded. This is one case where
grep actually works better than glark -- grep displays
the matching line perfectly, assuming my terminal
handles utf8.
Glark is fantastic, thanks!
Logged In: YES
user_id=15056
Note that I didn't type a bunch of escapes on the command
line -- sf turned japanese characters into escapes.
Logged In: YES
user_id=316860
Thanks for the feedback. Could you please attach an input file and a file
containing the string you'd query for? I'm not having success this myself.
glark makes assumptions that one wants to search only text files, and
assumes ASCII. However, the option: "--binary=text" might result in the
behavior you're seeking.
If there's a more elegant way to determine this, I'd like to know. Perhaps the
regular expressions themselves could be examined for non-ASCII characters,
and if found, the ASCII check wouldn't be done.
Logged In: YES
user_id=15056
glark --binary=text foo bar.txt
tells me option not understood: --binary=text
I've attached input.utf8.html, which contains a string
matched by query.utf8.txt, which I'll upload momentarily.
contains text that matches query.utf8.txt
Logged In: YES
user_id=15056
upload file that matches text in input.utf8.html
Logged In: YES
user_id=15056
upload file that matches string in input.utf8.html
matches string in input.utf8.html
Logged In: YES
user_id=15056
By the way, file(1) correctly picks out utf-8 encoded text.
Excerpt from its man page:
If a file does not match any of the entries in the
magic
file, it is examined to see if it seems to be a text
file.
ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII
character
sets (such as those used on Macintosh and IBM PC
systems),
UTF-8-encoded Unicode, UTF-16-encoded Unicode, and
EBCDIC
character sets can be distinguished by the
different
ranges and sequences of bytes that constitute
printable
text in each set.
Logged In: YES
user_id=316860
The examples files are working for me, with my limited
settings for Japanese. That is, glark is showing matches
at lines 4, 11, 20, 26, and 200. It seems that the main
problem would be the ASCII test.
My example earlier was incorrect; I meant to suggest the
option "--binary-files=text". However, the function to
detect text files will be updated for UTF-8.
Logged In: YES
user_id=15056
The "--binary-files=text" option works well for me. I
cannot explain my stupidity at not looking at usage. Thanks
for your help!