Share

Xena: digital preservation application

Tracker: Bugs

2 Document encoded in GB18030 causes exception - ID: 2188076
Last Update: Settings changed ( jwaddell )

Whilst looking at the legacy files which have been converted into 'modern'
formats, I came across several plaintext documents encoded in GB18030.

Xena spat the dummy when trying to normalise them. Console output below...

Not sure if this is a bug or a feature request, but it would be nice to be
able to handle documents with a non-standard character set

------------------------------------------------
FINEST: XIS
file:/T:/Legacy%20-%20format%20converted/C379P1/PHASE_3/MEDIA/053/379P153A.
03D guessed as type PlainText
22/10/2008 09:48:28 au.gov.naa.digipres.xena.kernel.guesser.GuesserManager
getBestGuess
FINER: Exception thrown in guesser NonStandardPlainTextGuesser
au.gov.naa.digipres.xena.kernel.XenaException:
java.io.UnsupportedEncodingException: GB18030
at
au.gov.naa.digipres.xena.plugin.plaintext.NonStandardPlainTextGuesser
.guess(NonStandardPlainTextGuesser.java:98)
at
au.gov.naa.digipres.xena.kernel.guesser.GuesserManager.getBestGuess(G
uesserManager java:358)
at
au.gov.naa.digipres.xena.kernel.guesser.GuesserManager.mostLikelyType
(GuesserManager.java:262)
at
au.gov.naa.digipres.xena.core.Xena.getMostLikelyType(Xena.java:258)
at
au.gov.naa.digipres.xena.core.Xena.getMostLikelyType(Xena.java:243)
at
au.gov.naa.digipres.xena.litegui.NormalisationThread.setTypes(Normali
sationThread.java:419)
at
au.gov.naa.digipres.xena.litegui.NormalisationThread.normaliseStandar
d(NormalisationThread.java:196)
at
au.gov.naa.digipres.xena.litegui.NormalisationThread.run(Normalisatio
nThread.java:144)
Caused by: java.io.UnsupportedEncodingException: B18030
at sun.nio.cs.StreamDecoder.forInputStreamReader(Unknown Source)
at java.io.InputStreamReader.<init>(Unknown Source)
at
au.gov.naa.digipres.xena.plugin.plaintext.NonStandardPlainTextGuesser
.guess(NonStandardPlainTextGuesser.java:91)
... 7 more


John ( vombatus ) - 2008-10-23 00:41

2

Open

Wont Fix

Michael Carden

None

None

Public


Comments ( 4 )




Date: 2009-11-06 03:05
Sender: jwaddell

I don't think this file can be considered as plaintext. I'm currently
trialling a new charset detection library, and it identifies it as UTF-32,
but this is also incorrect as using this encoding produces output with
non-plaintext characters. The Linux 'file' command identifies the file as
'data' (ie binary).
Printing out the contents of the file from a terminal results in mostly
binary garbage.
If we want to be able to preserve these files we will most likely need to
produce a custom normaliser, which should be a feature request.


Date: 2008-10-23 02:32
Sender: vombatus

From the console output, I assume that the file was guessed as Non-Standard
plaintext. It was wrapped as binary. I only ran the file through xena, not
DPR.

At a rough guess, I think that the file came from a VAX system. Given that
the likelihood of someone actually requesting this stuff is very low, I am
happy for this to be given a very low priority.


Date: 2008-10-23 01:25
Sender: jwaddell

On my machine Xena identifies the encoding as GB18030, reads a set of
characters using this encoding but as there are some non-text characters
read Xena determines that it is not a valid GB18030 file and therefore not
actually a plain text file (and normalises it with Binary). GB18030 is
actually Chinese Simplified, which is most likely not the actual encoding.
It is possible that the encoding detection library does not know about the
encoding used in this file and thus its "wild stab in the dark" of GB18030
is wrong. Any ideas on what the encoding actually is?


Date: 2008-10-23 01:15
Sender: jwaddell

I looked at the list of supported encodings in Java and GB18030 should be
supported, so this looks like a bug, I'll look into it.

However the exception you listed here occurred during the guessing phase,
and Xena should have normalised the file using Binary... is this what
happened, or did it guess the file as something random?


Log in to comment.




Attached File ( 1 )

Filename Description Download
379P153A.040 File encoded in GB18030 Download

Changes ( 6 )

Field Old Value Date By
assigned_to jwaddell 2009-11-06 03:05 jwaddell
resolution_id Accepted 2009-11-06 03:05 jwaddell
priority 5 2008-10-23 02:32 vombatus
resolution_id None 2008-10-23 01:15 jwaddell
assigned_to nobody 2008-10-23 01:15 jwaddell
File Added 298491: 379P153A.040 2008-10-23 00:41 vombatus