Menu

#256 Document encoded in GB18030 causes exception

closed-wont-fix
None
2
2010-04-08
2008-10-23
John
No

Whilst looking at the legacy files which have been converted into 'modern' formats, I came across several plaintext documents encoded in GB18030.

Xena spat the dummy when trying to normalise them. Console output below...

Not sure if this is a bug or a feature request, but it would be nice to be able to handle documents with a non-standard character set

------------------------------------------------
FINEST: XIS file:/T:/Legacy%20-%20format%20converted/C379P1/PHASE_3/MEDIA/053/379P153A.03D guessed as type PlainText
22/10/2008 09:48:28 au.gov.naa.digipres.xena.kernel.guesser.GuesserManager getBestGuess
FINER: Exception thrown in guesser NonStandardPlainTextGuesser
au.gov.naa.digipres.xena.kernel.XenaException: java.io.UnsupportedEncodingException: GB18030
at au.gov.naa.digipres.xena.plugin.plaintext.NonStandardPlainTextGuesser
.guess(NonStandardPlainTextGuesser.java:98)
at au.gov.naa.digipres.xena.kernel.guesser.GuesserManager.getBestGuess(G
uesserManager java:358)
at au.gov.naa.digipres.xena.kernel.guesser.GuesserManager.mostLikelyType
(GuesserManager.java:262)
at au.gov.naa.digipres.xena.core.Xena.getMostLikelyType(Xena.java:258)
at au.gov.naa.digipres.xena.core.Xena.getMostLikelyType(Xena.java:243)
at au.gov.naa.digipres.xena.litegui.NormalisationThread.setTypes(Normali
sationThread.java:419)
at au.gov.naa.digipres.xena.litegui.NormalisationThread.normaliseStandar
d(NormalisationThread.java:196)
at au.gov.naa.digipres.xena.litegui.NormalisationThread.run(Normalisatio
nThread.java:144)
Caused by: java.io.UnsupportedEncodingException: B18030
at sun.nio.cs.StreamDecoder.forInputStreamReader(Unknown Source)
at java.io.InputStreamReader.<init>(Unknown Source)
at au.gov.naa.digipres.xena.plugin.plaintext.NonStandardPlainTextGuesser
.guess(NonStandardPlainTextGuesser.java:91)
... 7 more

Discussion

  • John

    John - 2008-10-23

    File encoded in GB18030

     
  • Justin Waddell

    Justin Waddell - 2008-10-23

    I looked at the list of supported encodings in Java and GB18030 should be supported, so this looks like a bug, I'll look into it.

    However the exception you listed here occurred during the guessing phase, and Xena should have normalised the file using Binary... is this what happened, or did it guess the file as something random?

     
  • Justin Waddell

    Justin Waddell - 2008-10-23
    • assigned_to: nobody --> jwaddell
    • status: open --> open-accepted
     
  • Justin Waddell

    Justin Waddell - 2008-10-23

    On my machine Xena identifies the encoding as GB18030, reads a set of characters using this encoding but as there are some non-text characters read Xena determines that it is not a valid GB18030 file and therefore not actually a plain text file (and normalises it with Binary). GB18030 is actually Chinese Simplified, which is most likely not the actual encoding. It is possible that the encoding detection library does not know about the encoding used in this file and thus its "wild stab in the dark" of GB18030 is wrong. Any ideas on what the encoding actually is?

     
  • John

    John - 2008-10-23

    From the console output, I assume that the file was guessed as Non-Standard plaintext. It was wrapped as binary. I only ran the file through xena, not DPR.

    At a rough guess, I think that the file came from a VAX system. Given that the likelihood of someone actually requesting this stuff is very low, I am happy for this to be given a very low priority.

     
  • John

    John - 2008-10-23
    • priority: 5 --> 2
     
  • Justin Waddell

    Justin Waddell - 2009-11-06

    I don't think this file can be considered as plaintext. I'm currently trialling a new charset detection library, and it identifies it as UTF-32, but this is also incorrect as using this encoding produces output with non-plaintext characters. The Linux 'file' command identifies the file as 'data' (ie binary).
    Printing out the contents of the file from a terminal results in mostly binary garbage.
    If we want to be able to preserve these files we will most likely need to produce a custom normaliser, which should be a feature request.

     
  • Justin Waddell

    Justin Waddell - 2009-11-06
    • assigned_to: jwaddell --> mcarden
    • status: open-accepted --> open-wont-fix
     
  • Michael Carden

    Michael Carden - 2010-04-08
    • status: open-wont-fix --> closed-wont-fix
     

Log in to post a comment.