Xena - Digital Preservation Software / Bugs / #256 Document encoded in GB18030 causes exception

#256 Document encoded in GB18030 causes exception

Status: closed-wont-fix

Owner: Michael Carden

Labels: None

Priority: 2

Updated: 2010-04-08

Created: 2008-10-23

Creator: John

Private: No

Whilst looking at the legacy files which have been converted into 'modern' formats, I came across several plaintext documents encoded in GB18030.

Xena spat the dummy when trying to normalise them. Console output below...

Not sure if this is a bug or a feature request, but it would be nice to be able to handle documents with a non-standard character set

------------------------------------------------
FINEST: XIS file:/T:/Legacy%20-%20format%20converted/C379P1/PHASE_3/MEDIA/053/379P153A.03D guessed as type PlainText
22/10/2008 09:48:28 au.gov.naa.digipres.xena.kernel.guesser.GuesserManager getBestGuess
FINER: Exception thrown in guesser NonStandardPlainTextGuesser
au.gov.naa.digipres.xena.kernel.XenaException: java.io.UnsupportedEncodingException: GB18030
at au.gov.naa.digipres.xena.plugin.plaintext.NonStandardPlainTextGuesser
.guess(NonStandardPlainTextGuesser.java:98)
at au.gov.naa.digipres.xena.kernel.guesser.GuesserManager.getBestGuess(G
uesserManager java:358)
at au.gov.naa.digipres.xena.kernel.guesser.GuesserManager.mostLikelyType
(GuesserManager.java:262)
at au.gov.naa.digipres.xena.core.Xena.getMostLikelyType(Xena.java:258)
at au.gov.naa.digipres.xena.core.Xena.getMostLikelyType(Xena.java:243)
at au.gov.naa.digipres.xena.litegui.NormalisationThread.setTypes(Normali
sationThread.java:419)
at au.gov.naa.digipres.xena.litegui.NormalisationThread.normaliseStandar
d(NormalisationThread.java:196)
at au.gov.naa.digipres.xena.litegui.NormalisationThread.run(Normalisatio
nThread.java:144)
Caused by: java.io.UnsupportedEncodingException: B18030
at sun.nio.cs.StreamDecoder.forInputStreamReader(Unknown Source)
at java.io.InputStreamReader.<init>(Unknown Source)
at au.gov.naa.digipres.xena.plugin.plaintext.NonStandardPlainTextGuesser
.guess(NonStandardPlainTextGuesser.java:91)
... 7 more

Discussion

John - 2008-10-23

File encoded in GB18030

379P153A.040

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Justin Waddell - 2008-10-23

I looked at the list of supported encodings in Java and GB18030 should be supported, so this looks like a bug, I'll look into it.

However the exception you listed here occurred during the guessing phase, and Xena should have normalised the file using Binary... is this what happened, or did it guess the file as something random?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Justin Waddell - 2008-10-23

assigned_to: nobody --> jwaddell

status: open --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Justin Waddell - 2008-10-23

On my machine Xena identifies the encoding as GB18030, reads a set of characters using this encoding but as there are some non-text characters read Xena determines that it is not a valid GB18030 file and therefore not actually a plain text file (and normalises it with Binary). GB18030 is actually Chinese Simplified, which is most likely not the actual encoding. It is possible that the encoding detection library does not know about the encoding used in this file and thus its "wild stab in the dark" of GB18030 is wrong. Any ideas on what the encoding actually is?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John - 2008-10-23

From the console output, I assume that the file was guessed as Non-Standard plaintext. It was wrapped as binary. I only ran the file through xena, not DPR.

At a rough guess, I think that the file came from a VAX system. Given that the likelihood of someone actually requesting this stuff is very low, I am happy for this to be given a very low priority.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John - 2008-10-23

priority: 5 --> 2
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Justin Waddell - 2009-11-06

I don't think this file can be considered as plaintext. I'm currently trialling a new charset detection library, and it identifies it as UTF-32, but this is also incorrect as using this encoding produces output with non-plaintext characters. The Linux 'file' command identifies the file as 'data' (ie binary).
Printing out the contents of the file from a terminal results in mostly binary garbage.
If we want to be able to preserve these files we will most likely need to produce a custom normaliser, which should be a feature request.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Justin Waddell - 2009-11-06

assigned_to: jwaddell --> mcarden

status: open-accepted --> open-wont-fix
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Carden - 2010-04-08

status: open-wont-fix --> closed-wont-fix
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.