codesearch / Tickets / #26 Files containing ascii8 are not indexed (feature/request)

Comment has been marked as spam.
Undo

View and moderate all "tickets Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Tickets"

Anonymous - 2012-11-20

Originally posted by: kilve...@gmail.com

But, i don't see a log message about skipping the file on the console. cindex run looks normal.

*Originally posted by:* [kilve...@gmail.com](http://code.google.com/u/111632010187656004753/) But, i don't see a log message about skipping the file on the console. cindex run looks normal.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Comment has been marked as spam.
Undo

View and moderate all "tickets Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Tickets"

Anonymous - 2012-11-21

Originally posted by: manpreet...@gmail.com

I encountered all these issues you mentioned and was annoyed enough by them to implement the following changes for myself at https://github.com/junkblocker/codesearch

1) Do not stop at first bad UTF-8 character encountered. Instead allow a percentage of non-UTF-8 characters to be in the document. These are ignored but the rest of the document gets indexed. The option, which I call, -maxinvalidutf8ratio, defaults to 0.1. This combined with considering a document containing a 0x00 byte as binary has been working great for me.

2) Allow custom trigrams size. The current hardcoded limit is at 20000 trigrams but I sadly have to work on code with one important source file beyond that. (-maxtrigrams).

3) Add message and reasoning for every document skipped from indexing.

I would love to get those changes merged or at least considered for alternate implementation here in this official sources but am not sure about the aliveness of project here.

*Originally posted by:* [manpreet...@gmail.com](http://code.google.com/u/113677778010356620791/) I encountered all these issues you mentioned and was annoyed enough by them to implement the following changes for myself at [https://github.com/junkblocker/codesearch](https://github.com/junkblocker/codesearch) 1\) Do not stop at first bad UTF-8 character encountered. Instead allow a percentage of non-UTF-8 characters to be in the document. These are ignored but the rest of the document gets indexed. The option, which I call, -maxinvalidutf8ratio, defaults to 0.1. This combined with considering a document containing a 0x00 byte as binary has been working great for me. 2\) Allow custom trigrams size. The current hardcoded limit is at 20000 trigrams but I sadly have to work on code with one important source file beyond that. \(-maxtrigrams\). 3\) Add message and reasoning for every document skipped from indexing. I would love to get those changes merged or at least considered for alternate implementation here in this official sources but am not sure about the aliveness of project here.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Comment has been marked as spam.
Undo

View and moderate all "tickets Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Tickets"

Anonymous - 2012-12-05

Originally posted by: rsc@golang.org

The project is not super alive. Mostly the code just works and we
leave it alone. I think the UTF-8 heuristic works pretty well as does
the trigram size heuristic. It's possible to tune these forever, of
course. How many trigrams does your important file have?

I thought that the indexer already did print about files it skipped if
you run it in verbose mode, but maybe I am misremembering.

*Originally posted by:* [rsc@golang.org](http://code.google.com/u/rsc@golang.org/) The project is not super alive. Mostly the code just works and we leave it alone. I think the UTF-8 heuristic works pretty well as does the trigram size heuristic. It's possible to tune these forever, of course. How many trigrams does your important file have? I thought that the indexer already did print about files it skipped if you run it in verbose mode, but maybe I am misremembering.

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Comment has been marked as spam.
Undo

View and moderate all "tickets Discussion" comments posted by this user

Mark all as spam, and block user from posting to "Tickets"

Anonymous - 2012-12-06

Originally posted by: manpreet...@gmail.com

All source files being UTF-8 is a pretty big assumption. A lot of files may be latin-1 etc. which is the most common problem I encountered. Having random european author's name with a diacritic in the source or some cyrillic, for example, loses a whole file from index making codesearch something that can't be depended on at all. When I am changing code based on what codesearch finds in my codebase, I don't wanna miss some files for this reason. codesearch should not be less reliable that a regular grep.

The file I mentioned is around 30K trigrams. It was simple to just add a custom limit flag.

The indexer misses the warning in a couple of places mainly because of the assumptions it makes about the input data. The one example I recall off the top of my head is about quietly ignoring symlinked paths (which I submitted another patch to optionally not ignore for).

*Originally posted by:* [manpreet...@gmail.com](http://code.google.com/u/113677778010356620791/) All source files being UTF-8 is a pretty big assumption. A lot of files may be latin-1 etc.  which is the most common problem I encountered. Having random european author's name with a diacritic in the source or some cyrillic, for example, loses a whole file from index making codesearch something that can't be depended on at all. When I am changing code based on what codesearch finds in my codebase, I don't wanna miss some files for this reason. codesearch should not be less reliable that a regular grep. The file I mentioned is around 30K trigrams. It was simple to just add a custom limit flag. The indexer misses the warning in a couple of places mainly because of the assumptions it makes about the input data. The one example I recall off the top of my head is about quietly ignoring symlinked paths \(which I submitted another patch to optionally not ignore for\).

Add attachments
Cancel
You seem to have CSS turned off. Please don't fill out this field.

You seem to have CSS turned off. Please don't fill out this field.

New Attachment:

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Files containing ascii8 are not indexed (feature/request)

Searches

Help

#26 Files containing ascii8 are not indexed (feature/request)

Related

Discussion