Non ASCII characters in string literals are incorrectly encoded in the result

Status: Beta

Brought to you by: andreas_schultz, bizer, ppetrovski, robertisele

#5 Non ASCII characters in string literals are incorrectly encoded in the result

Milestone: v1.0_(example)

Status: open

Owner: nobody

Labels: None

Priority: 5

Updated: 2014-08-18

Created: 2014-04-01

Creator: Vuk Mijovic

Private: No

This was detected on data containing Serbian letters, for example string "Adica, turističko društvo" was encoded as "Adica, turisti\u00C4\u008Dko dru\u00C5\u00A1tvo" in the result. In the attachment you can find the data we used:

1. input.ttl: input file
2. mappings.ttl: mappings
3. vocabulary.txt: target vocabulary
4. output.ttl: output that was produced

1 Attachments

bug.zip

Discussion

Andreas Schultz - 2014-04-03

It looks like that you used the N-Triples output which always encodes non-ASCII characters. You can try the Turtle output instead (TurtleOutput class).

And I think by "incorrectly" you probably meant "unnecessarily", since the output file would still be parsed correctly.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vuk Mijovic - 2014-04-04

OK, will try Turtle output, but it still seems to be a bug. We tried loading the output into Virtuoso and it decoded "Adica, turisti\u00C4\u008Dko dru\u00C5\u00A1tvo" as "Adica, turistiÄko druÅ¡tvo" and not as "Adica, turističko društvo" which is expected. http://www.branah.com/unicode-converter decodes the string encoded by r2r exactly the same as Virtuoso so it doesn't seem to be that Virtuoso is decoding the strings wrongly.

According to both Virtuoso and http://www.branah.com/unicode-converter, the string should have been encoded as "Adica, turisti\u010dko dru\u0161tvo" in order to be decoded as "Adica, turističko društvo" (Wikipedia also agrees that codes for č and š are 010d and 0161)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andreas Schultz - 2014-04-04

Ah, I see. Just tested your mapping file against your input and got the attached output. It seems to be encoded correctly. I used the current trunk version, which one are you using?

example1_output.nt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Vuk Mijovic - 2014-04-07

This is 0.2.3. It was built from source.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Andreas Schultz - 2014-04-30

Ok, I also tested with the 0.2.3 version, which gives me the same (correct) result as in my previous post. The code to run your mappings is attached (mapping and data files are renamed).

Can you post the code you used to execute the mappings?

ExampleNew.java

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.