#1 utf8 support

open
nobody
None
5
2004-04-02
2004-04-02
No

> glark 帰属―非営利―派生禁止 *.*
Binary file foo_jp.html matches

foo_jp.html is utf8 encoded. This is one case where
grep actually works better than glark -- grep displays
the matching line perfectly, assuming my terminal
handles utf8.

Glark is fantastic, thanks!

Discussion

  • Mike Linksvayer

    Mike Linksvayer - 2004-04-02

    Logged In: YES
    user_id=15056

    Note that I didn't type a bunch of escapes on the command
    line -- sf turned japanese characters into escapes.

     
  • Jeff Pace

    Jeff Pace - 2004-04-03

    Logged In: YES
    user_id=316860

    Thanks for the feedback. Could you please attach an input file and a file
    containing the string you'd query for? I'm not having success this myself.

    glark makes assumptions that one wants to search only text files, and
    assumes ASCII. However, the option: "--binary=text" might result in the
    behavior you're seeking.

    If there's a more elegant way to determine this, I'd like to know. Perhaps the
    regular expressions themselves could be examined for non-ASCII characters,
    and if found, the ASCII check wouldn't be done.

     
  • Mike Linksvayer

    Mike Linksvayer - 2004-04-03

    Logged In: YES
    user_id=15056

    glark --binary=text foo bar.txt

    tells me option not understood: --binary=text

    I've attached input.utf8.html, which contains a string
    matched by query.utf8.txt, which I'll upload momentarily.

     
  • Mike Linksvayer

    Mike Linksvayer - 2004-04-03

    contains text that matches query.utf8.txt

     
  • Mike Linksvayer

    Mike Linksvayer - 2004-04-03

    Logged In: YES
    user_id=15056

    upload file that matches text in input.utf8.html

     
  • Mike Linksvayer

    Mike Linksvayer - 2004-04-03

    Logged In: YES
    user_id=15056

    upload file that matches string in input.utf8.html

     
  • Mike Linksvayer

    Mike Linksvayer - 2004-04-03

    matches string in input.utf8.html

     
  • Mike Linksvayer

    Mike Linksvayer - 2004-04-03

    Logged In: YES
    user_id=15056

    By the way, file(1) correctly picks out utf-8 encoded text.
    Excerpt from its man page:

    If a file does not match any of the entries in the
    magic
    file, it is examined to see if it seems to be a text
    file.
    ASCII, ISO-8859-x, non-ISO 8-bit extended-ASCII
    character
    sets (such as those used on Macintosh and IBM PC
    systems),
    UTF-8-encoded Unicode, UTF-16-encoded Unicode, and
    EBCDIC
    character sets can be distinguished by the
    different
    ranges and sequences of bytes that constitute
    printable
    text in each set.

     
  • Jeff Pace

    Jeff Pace - 2004-04-03

    Logged In: YES
    user_id=316860

    The examples files are working for me, with my limited
    settings for Japanese. That is, glark is showing matches
    at lines 4, 11, 20, 26, and 200. It seems that the main
    problem would be the ASCII test.

    My example earlier was incorrect; I meant to suggest the
    option "--binary-files=text". However, the function to
    detect text files will be updated for UTF-8.

     
  • Mike Linksvayer

    Mike Linksvayer - 2004-04-03

    Logged In: YES
    user_id=15056

    The "--binary-files=text" option works well for me. I
    cannot explain my stupidity at not looking at usage. Thanks
    for your help!

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks