DBpedia - Wikipedia Data Extraction / Tracker / #70 Invalid N-Triple Output

Invalid N-Triple Output

#70 Invalid N-Triple Output

Milestone: Serialization

Status: open-accepted

Owner: Christopher Sahnwaldt

Labels: Bug (92)

Priority: 5

Updated: 2012-03-18

Created: 2011-07-04

Creator: Alex

Private: No

Hi,

The extraction framework generates Invalid N-Triple output files. It escapes the UTF-8 Characters in the literals but fails to do so for IRIs, which results in invalid .nt files for any IRIs that contain non-ASCII characters. This is fine for Virtuoso, which doesn't seem to check for the validity of the N-Triple files during the import, but other Triplestores such as Allegrograph and RDF frameworks and tools such as Raptor and Python RDFlib do rigorous checking and fail when non-Ascii characters are present in N-Triple files.

Looking trough the code, namely org.dbpedia.extraction.destinations.Quad I've seen that you're aware of the problem. Could you provide an estimate for a bug-fix or provide some tips on how I could patch it myself.

Kind Regards,
Alexandru

Discussion

Alex - 2011-07-06

Ok, did it myself, but it doesn't help, most tools will still spew syntax errors. I solved the problem by changing the output to the Trix format which supports UTF8 natively.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Alex - 2011-07-06

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christopher Sahnwaldt - 2012-03-17

milestone: --> 2702415

assigned_to: nobody --> jcsahnwaldt

labels: 973128 --> Bug

status: closed --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christopher Sahnwaldt - 2012-03-17

Probably fixed. I''ll check.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Christopher Sahnwaldt - 2012-03-18

milestone: 2702415 --> Serialization
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Invalid N-Triple Output

Group

Searches

Help

#70 Invalid N-Triple Output

Discussion