Re: [Sparql4j-devel] RE: XML parsing
Status: Pre-Alpha
Brought to you by:
jsaarela
From: Samppa S. <sam...@pr...> - 2006-01-05 09:52:03
|
>ResourceFactory (it creates them relative to a hidden model but all >operations, if passed a resource from one model convert to a resource in >the model where the operation is to be performed automatically. >Resources need to know where they come from so > resource.getProperty() >works. > >Otherwsie, an application can quite easily work at the Jena Graph level. >It is stable and more pure (less convenience) and triples are, well, >triples (3-tuples). > >Which per-model caches are you referring to? (and all Jena's caches are >just that - caches - things work if they get bypassed). > > EnhGraph.enhNodes... Things work, but using cache produces less gargage and creating resources directly to the target model is better than having multiple caches (one internal to the factory and one to the target model). Anyway these are very subtle differences and the important thing is that things work. >Caching partial query results is v hard unless you can deduce one query >is a sub-query os another. > > True - and that's not what I ment. What I was really questioning that is it really enough to have the factory defined at driver -level, or do we need the ability to set/override the factory at statement level AND that what's the best alternative for this depends on the toolkit used. >>But there's no standard way in jdbc for user to access this >> >> >information. > > >>If the user is provided with an access to InputStream of the result, >> >> >he > > >>needs to get access to the content type also. >> >> > >The driver would access the information and so know how to parse the >incoming RDF graph. In fact, it needs a factory interface > >interface GraphFactory >{ > Object parse(InputStream, String httpContentTypeAndCharset) ; >} > >then a CONSTRUCT returns a 1-row,1-col result set : getObject() is the >graph. getCharacterStream() or getBinaryStream() would give a more >direct access if needed but (see below) I don't see these are common. > > I see. getCharacterStream and getBinaryStream are definitely better alternatives than clob/blob. This approach may have the drawback that the sequence at which column-accessor-methods are called makes the result set behave differently, e.g. if getBinaryStream(1) is called first, getObject(1) is no longer available and vice versa (i.e. unless the binary stream is cached). Also this kind of dual type -column cannot be defined via ResultSetMetaData, can it? If this same approach would be applied select/ask results (i.e. accessing getBinaryStream(1) of the first row would return whole result as a stream, instead of rows) the result would be even more confusing. >>>Could you give the use case you have in mind here? (why is it more >>>convenient to have a stream of triples?) >>> >>> >>> >>> >>I use frequently Model.listStatements variants - and have used in >> >> >every > > >>RDF based applications I've ever made using Jena or SIR ;-) I wouldn't >>like the performance penalty nor increased memory requirements of >>having to read the results first into a model just for iterating over >>them. One could also argue that every (reading) RDF operation involves >>ultimately a stream/iteration of triples. Sure there's convenience >>accesses filtering objects of the statements or select-type query >>returning bindings, but these operations in turn rely on statement >>iterations. [When building a generic program that doesn't have full >>control of all input, the select-query- access is strictly speaking >> >> >not > > >>usable if "told bnodes" are not supported.] >> >> > >We need to go back to use cases and the role of SPARQL4j. > > > Yes :-) >An important value of SPARQl4j (for me) is that an application, which is >not very RDF aware, to be able to get information out of remote RDF >repository by SPARQL query. > > However, relying on external RDF API (or InputStream) to handle certain kind of queries makes it highly RDF dependent and thus requires ultimatly that the user is not only aware of RDF but also aware of the (configuration dependent) RDF toolkit. The design I'm proposing aimes 1) at providing as-natural-as-it-gets jdbc (row/column -based) behaviour in all cases 2) not blocking more RDF/Sparql aware use cases / applications and 3) making the decision between the desired approach explicit (i.e. an application that is aware of the result forms may access the results directly by being aware of the Sparql-specific jdbc api extensions), 4) providing the user all the necessary information and control (e.g. ability to define accept preferences when executing a query and access to the actual content type of the result) needed to process the results directly, 5) providing factory-based / RDF toolkit dependent getObject() accessors (only) as convenience accessors for RDF aware application >JDBC is not a good match to getting RDF graphs (as we are finding!) and >choosing one processing model (streams of something) over another makes >assumptions about the nature and approach of the toolkit. > > That's why I'd try to avoid any direct/required dependency to the factory, providing it only (and possibly optionally) for convenience to more RDF aware users. In my opinion the triple-per-row* is as-good-as-it-gets alternative for row based (i.e. jdbc style) handling of RDF. It actually resembles one of RDF's serialization forms, namely NTRIPLES. Also the W3C's RDF Validator provides tabular form of the parsed graph, which I have found quite usefull. *) triple-per-row: // JDBC-specific access exposed via ResultSetMetaData String subject = rs.getString("subject"); int subjectType = rs.getInt("subject$type"); // URI | BLANK_NODE String predicate = ns.getString("predicate"); String object = rs.getString("object"); int objectType = rs.getInt("object$type"); // URI | BLANK_NODE | PLAIN_LITERAL | TYPED_LITERAL String lang = rs.getString("object$lang"); String datatype = rs.getString("object$datatype"); // RDF toolkit specific convenience/hidden access Resource subject = (Resource)rs.getObject("subject"); Predicate predicate = (Predicate)rs.getObject("predicate"); RDFNode object = (RDFNode)rs.getObject("object"); Note that the JDBC-specific access can easily be used to provide a configuration (i.e. factory) independent access to any RDF toolkit specific objects. It's also by far more robust way of accessing the toolkit specific resources than the factory approach which would end up in ClassCastExceptions if configuration changed. Actually the factory approach could aesily be replaced with simple (and robust) ResultSet wrappers / handlers. >If the application wants to do listStatements AND wants triples in the >local toolkit format, then the value of a JDBC interface (which is >row-oriented) is pretty slim. So I do not see a high value for SPARQl4j >as a general connection over the SPARQL protocol in the initial >releases. > > Triple-per-row on the other hand offers even something usefull for a not so RDF/Sparql/toolkit aware application. The value of providing only sparql result form parsing is also pretty slim - I have actually already implemented it (just not yet committed it into CVS). As parsing RDF/XML (and/or N3) is much more difficult, providing just that in a toolkit independent way would actually provide extra value. BTW Isn't it a bit contrary to the open-world view of Semantic Web, when you argued in your previous email that a model isn't usable unless all of it's statements are known? >It would seem likely that every RDF toolkit will have a built-in SPARQL >client so if the application is doing RDF processing, it is much better >to use that that trying to fit around the JDBC row-oritneted paradigm. >It's pretty easy to write (that part ARQ is quite small - some rather >tedious HTTP connection bashing). > > So what's the point of sparql4j then? >Also, given triples doesn't come back in any particular order from >CONSTRUCT then I find it hard to see many processing models that can >achieve streamability. Maybe you could sketch your use case in a little more detail? It's that bit I'm puzzled by. > > Firstly there's a difference in building application specific (domain) model from a stream (of triples) or building first a generic RDF model and only then building the actual target model. Secondly, even though in general case the order of triples isn't guaranteed, it's quite common to group the statements by the subject. In case the contstruct matches are streamed directly, one could assume that triples of a single template match would be some how grouped. Hardly in any case the order of returned triples is fully random. The simplest and most obvious use case is to visualize the triples returned in a tabular form directly. Many GUI/WUI table widgets provide sorting of the rows by columns. >[[And there aren't any told bnodes in general (but ARQ you can get them >by setting the right config options :-) Not sure who will support told >bNodes. 3Store maybe.]] > > There also isn't a way of requiring that a certain type of resources should be URI's... I used to build (when using Jena) RDQL queries prorammatically using the query API directly since that way one can (or at least could) also use the bnodes in queries, thus avoiding failing queries in case bnodes were used. Of course I was told that I shouldn't do this and to use models/resources accessors instead... however when using RDB model with fast path, I was able to achieve magnitudes of better performance this way. If I recall right I even used (at least at some point) the ResultBinding#getTriples() to process the results. >The key is that it minimises the requirements on the client. If we >assume there is a complete RDF system in the client, why force ourselves >through the JDBC paradigm when we could just as easily have a >SPARQL-protocol specific API? The value of SPARQL4j to me is to connect >to applications that don't want a full RDF processing system but do want >to get some information out of an RDF repository. > > Exactly(!) and in my opinion this should apply also to the CONSTRUCT/DESCRIBE queries. Such an application could do hardly anything with a byte stream of RDF/XML not to mention N3. >>A graph may not be, but triples are also usable as such. >> >>Also I find the stream based access to the results quite usable >>regardles of the result form - at least if it's XML and not N3 (e.g. >>XSLT). >> >> > >That is XSLT on the XML results? > > Yes. >If you mean RDF/XML, the instability of the encoding is why DAWG had to >do a fixed schema XML results format. > > All the more reason why we should provide also RDF-parsing :-) Succeeding to do this in a toolkit independent way might also be usable to anyone building toolkits... >>Perhaps we should discuss and document what kind of use cases we wan >> >> >to > > >>support with sparql4j? >> >> > >Cool - good idea. > > Let's start a separate thread for this and copy-paste results into the document :-) -Samppa -- Samppa Saarela <samppa.saarela at profium.com> Profium, Lars Sonckin kaari 12, 02600 Espoo, Finland Tel. +358 (0)9 855 98 000 Fax. +358 (0)9 855 98 002 Mob. +358 (0)41 515 1412 Internet: http://www.profium.com |