HTML files with multi-byte encoding shown as garbage

Status: Pre-Alpha

Brought to you by: yanshi

#13 HTML files with multi-byte encoding shown as garbage

Milestone: v0.1.1

Status: open

Owner: Yan_sino

Labels: None

Priority: 5

Updated: 2011-03-09

Created: 2011-03-09

Creator: Yan_sino

Private: No

When indexing HTML files with multi-byte encoding (such as shift-jis), the encoding embedded in the file is not effective. Also it is not interpreted as the "native encoding" of the operating system (which happen to be also shift-jis). Instead, the characters in shift-jis kanji are shown as garbage in Google Desktop Search.
Plain Text file in shift-jis encoding, on the contrary, can be indexed without any problem.
Does not know if XML has the same problem as HTML. But since they seem to be processed by the same IFilter, it is possible that they suffer from the same problem.

HTML files with multi-byte encoding shown as garbage

Group

Searches

Help

#13 HTML files with multi-byte encoding shown as garbage

Discussion