RE: [Sparql4j-devel] RE: XML parsing
Status: Pre-Alpha
Brought to you by:
jsaarela
From: Seaborne, A. <and...@hp...> - 2006-01-06 16:34:05
|
-------- Original Message -------- > From: Samppa Saarela <> > Date: 5 January 2006 09:50 >=20 . . . . > > An important value of SPARQl4j (for me) is that an application, which > > is=20 > > not very RDF aware, to be able to get information out of remote RDF > > repository by SPARQL query. > >=20 > >=20 > However, relying on external RDF API (or InputStream) to handle certain > kind of queries makes it highly RDF dependent and thus requires > ultimatly that the user is not only aware of RDF but also aware of the > (configuration dependent) RDF toolkit. >=20 > The design I'm proposing aimes >=20 > 1) at providing as-natural-as-it-gets jdbc (row/column -based) behaviour > in all cases >=20 > 2) not blocking more RDF/Sparql aware use cases / applications and >=20 > 3) making the decision between the desired approach explicit (i.e. an > application that is aware of the result forms may access the results > directly by being aware of the Sparql-specific jdbc api extensions), >=20 > 4) providing the user all the necessary information and control (e.g. > ability to define accept preferences when executing a query and access > to the actual content type of the result) needed to process the results > directly, >=20 > 5) providing factory-based / RDF toolkit dependent getObject() accessors > (only) as convenience accessors for RDF aware application >=20 > > JDBC is not a good match to getting RDF graphs (as we are finding!) > > and=20 > > choosing one processing model (streams of something) over another > > makes=20 > > assumptions about the nature and approach of the toolkit. > >=20 > >=20 > That's why I'd try to avoid any direct/required dependency to the > factory, providing it only (and possibly optionally) for convenience to > more RDF aware users. In my opinion the triple-per-row* is > as-good-as-it-gets alternative for row based (i.e. jdbc style) handling > of RDF. It actually resembles one of RDF's serialization forms, namely > NTRIPLES. Also the W3C's RDF Validator provides tabular form of the > parsed graph, which I have found quite usefull. >=20 > *) triple-per-row: >=20 > // JDBC-specific access exposed via ResultSetMetaData > String subject =3D rs.getString("subject"); > int subjectType =3D rs.getInt("subject$type"); // URI | BLANK_NODE > String predicate =3D ns.getString("predicate"); > String object =3D rs.getString("object"); > int objectType =3D rs.getInt("object$type"); // URI | BLANK_NODE | > PLAIN_LITERAL | TYPED_LITERAL > String lang =3D rs.getString("object$lang"); > String datatype =3D rs.getString("object$datatype"); That looks just like like SELECT ?subject ?predicate ?object { ... pattern ... } + the accessors for type and lang >=20 > // RDF toolkit specific convenience/hidden access > Resource subject =3D (Resource)rs.getObject("subject"); > Predicate predicate =3D (Predicate)rs.getObject("predicate"); > RDFNode object =3D (RDFNode)rs.getObject("object"); >=20 > Note that the JDBC-specific access can easily be used to provide a > configuration (i.e. factory) independent access to any RDF toolkit > specific objects. It's also by far more robust way of accessing the > toolkit specific resources than the factory approach which would end up > in ClassCastExceptions if configuration changed. Actually the factory > approach could aesily be replaced with simple (and robust) ResultSet > wrappers / handlers. >=20 > > If the application wants to do listStatements AND wants triples in the > > local toolkit format, then the value of a JDBC interface (which is > > row-oriented) is pretty slim. So I do not see a high value for > > SPARQl4j=20 > > as a general connection over the SPARQL protocol in the initial > > releases. > >=20 > >=20 > Triple-per-row on the other hand offers even something usefull for a not > so RDF/Sparql/toolkit aware application. >=20 > The value of providing only sparql result form parsing is also pretty > slim - I have actually already implemented it (just not yet committed it > into CVS). As parsing RDF/XML (and/or N3) is much more difficult, > providing just that in a toolkit independent way would actually provide > extra value. >=20 > BTW Isn't it a bit contrary to the open-world view of Semantic Web, when > you argued in your previous email that a model isn't usable unless all > of it's statements are known? No, not at all. Entailment is defined on models. Open world says there may be other statements out there in other models. An application displaying the results of a query wants to know when it has seen all the results for its query on its choosen dada source. e.g. listing people by name makes an ordering assumption. When you see the "K"'s, there are no more "J"'s in this result. >=20 > > It would seem likely that every RDF toolkit will have a built-in > > SPARQL=20 > > client so if the application is doing RDF processing, it is much > > better=20 > > to use that that trying to fit around the JDBC row-oritneted paradigm. > > It's pretty easy to write (that part ARQ is quite small - some rather > > tedious HTTP connection bashing). > >=20 > >=20 > So what's the point of sparql4j then? SELECT queries for application that wish to access RDF information. For example, a RDF repository as the core of a 3-tier web site. No need for the business logic to have an RDF toolkit if it just wants to get people's name. If an application is going to do RDF processing on graphs, it would not want to use JDBC's view of row-by-row. Why not just get the constructed graph in whatever way its toolkit wants? Because there is a paradigm shift, the value of JDBC to moving graphs around seems limited to me. But SELECT queries, to get information out of RDF graphs, are the most important kind of query. Feed this into JDBC environments in the JDBC pardigm and we may even be able to reuse JDBC client tools. >=20 > > Also, given triples doesn't come back in any particular order from > > CONSTRUCT then I find it hard to see many processing models that can > > achieve streamability. Maybe you could sketch your use case in a > > little more detail? It's that bit I'm puzzled by.=20 > >=20 > >=20 > Firstly there's a difference in building application specific (domain) > model from a stream (of triples) or building first a generic RDF model > and only then building the actual target model. >=20 > Secondly, even though in general case the order of triples isn't > guaranteed, it's quite common to group the statements by the subject. I disagree - you're relying on the server having processed the entire model to make the statements come out nicely. Jena uses hashes all over the place - things do not come out in any sort of order, nor is it consistent. > In > case the contstruct matches are streamed directly, one could assume that > triples of a single template match would be some how grouped. Hardly in > any case the order of returned triples is fully random. >=20 > The simplest and most obvious use case is to visualize the triples > returned in a tabular form directly. Many GUI/WUI table widgets provide > sorting of the rows by columns. >=20 > > [[And there aren't any told bnodes in general (but ARQ you can get > > them=20 > > by setting the right config options :-) Not sure who will support > > told=20 > > bNodes. 3Store maybe.]] > >=20 > >=20 > There also isn't a way of requiring that a certain type of resources > should be URI's... >=20 > I used to build (when using Jena) RDQL queries prorammatically using the > query API directly since that way one can (or at least could) also use > the bnodes in queries, thus avoiding failing queries in case bnodes were > used. Of course I was told that I shouldn't do this and to use > models/resources accessors instead... however when using RDB model with > fast path, I was able to achieve magnitudes of better performance this > way. If I recall right I even used (at least at some point) the > ResultBinding#getTriples() to process the results. But it would not have worked with Joseki :-) bNodes not preserved across the network. You might like to see the ARQ configuration options I just put in so bnode ids are passed transparently across the network. It's not the default mode (makes the XML results look ugly). >=20 > > The key is that it minimises the requirements on the client. If we > > assume there is a complete RDF system in the client, why force > > ourselves=20 > > through the JDBC paradigm when we could just as easily have a > > SPARQL-protocol specific API? The value of SPARQL4j to me is to > > connect=20 > > to applications that don't want a full RDF processing system but do > > want=20 > > to get some information out of an RDF repository. > >=20 > >=20 > Exactly(!) and in my opinion this should apply also to the > CONSTRUCT/DESCRIBE queries. Such an application could do hardly anything > with a byte stream of RDF/XML not to mention N3. >=20 > > > A graph may not be, but triples are also usable as such. > > >=20 > > > Also I find the stream based access to the results quite usable > > > regardles of the result form - at least if it's XML and not N3 > > > (e.g. XSLT).=20 > > >=20 > > >=20 > >=20 > > That is XSLT on the XML results? > >=20 > >=20 > Yes. So it's a SELECT query. >=20 > > If you mean RDF/XML, the instability of the encoding is why DAWG had > > to=20 > > do a fixed schema XML results format. > >=20 > >=20 > All the more reason why we should provide also RDF-parsing :-) > Succeeding to do this in a toolkit independent way might also be usable > to anyone building toolkits... Then the objective of the project is now providing another API to RDF (wrapping all toolkits is just like a new toolkit that uses others as its implementation. I have a toolkit, and a SPARQL-protocol interface that (I think) is easiler to work with than warping to the JDBC paradign and and waroping back again - it's cognitive bruden on the app writer. I wanted an approach of doing something clearcut, minimal and distinct with a clear value. But this now seems to be growing into a general purpose RDF framework. Can we find some limits please? Andy >=20 > > > Perhaps we should discuss and document what kind of use cases we wan > > >=20 > > >=20 > > to > >=20 > >=20 > > > support with sparql4j? > > >=20 > > >=20 > >=20 > > Cool - good idea. > >=20 > >=20 > Let's start a separate thread for this and copy-paste results into the > document :-) >=20 > -Samppa >=20 > -- > Samppa Saarela <samppa.saarela at profium.com> Profium, Lars Sonckin > kaari 12, 02600 Espoo, Finland Tel. +358 (0)9 855 98 000 Fax. +358 (0)9 > 855 98 002 Mob. +358 (0)41 515 1412 Internet: http://www.profium.com |