From: Rob V. <rv...@do...> - 2014-02-25 10:23:14
|
Tom Yes the original NTriples and NQuads specifications only allow ASCII, this was by design to make those formats canonical (since with UTF-8 you can potentially encode complex characters in multiple ways) and facilitate reliable data exchange across systems that didn't necessarily support non-ASCII data. Btw the reader only enforces ASCII encoding if you pass a filename (I.e. when it deals with opening the file stream), if you pass in a pre-opened StreamReader that is in a different encoding (I.e. UTF-8) it may still parse successfully though exact behaviour is hard to know in advance. It will issue a warning about incorrect encoding (via the Warning event) and it may error out on some native UTF-8 data since the tokeniser is not written to expect native UTF-8. The RDF 1.1 working group have published proposed recommendations which standardise NQuads & NTriples and part of the standardization is to change the encoding to UTF-8 but I haven't had chance to update dotNetRDF to support the updated specs yet. Since this is a breaking change to spec and current API behaviour the existing tokenizers and parsers would need to be modified so that they can support either the new/old specification. An approach similar to how we updated Turtle support where we implement the new specifications and the parsers default to the new spec mode and the writers implement the new spec but default to producing the old spec as output would be ideal. This is Postel's law in action if you're wondering why this is done. There are issues filed for these upgrades but I haven't had time to implement them yet, I was considering trying to get these into the next release anyway and I have some time to start on this at the end of the week unless you want to attempt this yourself. See CORE-356 (http://dotnetrdf.org/tracker/Issues/IssueDetail.aspx?id=356) for NQuads and CORE-355 (http://dotnetrdf.org/tracker/Issues/IssueDetail.aspx?id=355) for NTriples which include links to the updated specifications, see the comments for the most up to date spec links. Hope this clarifies things, Cheers, Rob On 25/02/2014 10:06, "Tomasz Pluskiewicz" <tom...@gm...> wrote: >Hi Rob > >A colleague of mine has just discovered that the NQuadsParser reads >file with ASCII encoding while all other use UTF-8. > >I understand that this is as described in the specification but why is >that exactly? > >And what do you think about adding a option to the parsers so that >alternative encodings can be used for reading dataset files? > >Cheers, >Tom > >-------------------------------------------------------------------------- >---- >Flow-based real-time traffic analytics software. Cisco certified tool. >Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow Analyzer >Customize your own dashboards, set traffic alerts and generate reports. >Network behavioral analysis & security monitoring. All-in-one tool. >http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clkt >rk >_______________________________________________ >dotNetRDF-develop mailing list >dot...@li... >https://lists.sourceforge.net/lists/listinfo/dotnetrdf-develop |