HTML files with multi-byte encoding shown as garbage
Status: Pre-Alpha
Brought to you by:
yanshi
When indexing HTML files with multi-byte encoding (such as shift-jis), the encoding embedded in the file is not effective. Also it is not interpreted as the "native encoding" of the operating system (which happen to be also shift-jis). Instead, the characters in shift-jis kanji are shown as garbage in Google Desktop Search.
Plain Text file in shift-jis encoding, on the contrary, can be indexed without any problem.
Does not know if XML has the same problem as HTML. But since they seem to be processed by the same IFilter, it is possible that they suffer from the same problem.