Re: [Sparql4j-devel] RE: XML parsing

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

>ResourceFactory (it creates them relative to a hidden model but all
>operations, if passed a resource from one model convert to a resource in
>the model where the operation is to be performed automatically.
>Resources need to know where they come from so
>   resource.getProperty()
>works.
>
>Otherwsie, an application can quite easily work at the Jena Graph level.
>It is stable and more pure (less convenience) and triples are, well,
>triples (3-tuples).
>
>Which per-model caches are you referring to?  (and all Jena's caches are
>just that - caches - things work if they get bypassed).
>  
>
EnhGraph.enhNodes...

Things work, but using cache produces less gargage and creating 
resources directly to the target model is better than having multiple 
caches (one internal to the factory and one to the target model). Anyway 
these are very subtle differences and the important thing is that things 
work.

>Caching partial query results is v hard unless you can deduce one query
>is a sub-query os another.
>  
>
True - and that's not what I ment. What I was really questioning that is 
it really enough to have the factory defined at driver -level, or do we 
need the ability to set/override the factory at statement level AND that 
what's the best alternative for this depends on the toolkit used.

>>But there's no standard way in jdbc for user to access this
>>    
>>
>information.
>  
>
>>If the user is provided with an access to InputStream of the result,
>>    
>>
>he
>  
>
>>needs to get access to the content type also. 
>>    
>>
>
>The driver would access the information and so know how to parse the
>incoming RDF graph.  In fact, it needs a factory interface
>
>interface GraphFactory
>{
>	Object parse(InputStream, String httpContentTypeAndCharset) ;
>}
>
>then a CONSTRUCT returns a 1-row,1-col result set : getObject() is the
>graph.  getCharacterStream() or getBinaryStream() would give a more
>direct access if needed but (see below) I don't see these are common.
>  
>
I see. getCharacterStream and getBinaryStream are definitely better 
alternatives than clob/blob. This approach may have the drawback that 
the sequence at which column-accessor-methods are called makes the 
result set behave differently, e.g. if getBinaryStream(1) is called 
first, getObject(1) is no longer available and vice versa (i.e. unless 
the binary stream is cached). Also this kind of dual type -column cannot 
be defined via ResultSetMetaData, can it?

If this same approach would be applied select/ask results (i.e. 
accessing getBinaryStream(1) of the first row would return whole result 
as a stream, instead of rows) the result would be even more confusing.

>>>Could you give the use case you have in mind here? (why is it more
>>>convenient to have a stream of triples?)
>>>
>>>
>>>      
>>>
>>I use frequently Model.listStatements variants - and have used in
>>    
>>
>every
>  
>
>>RDF based applications I've ever made using Jena or SIR ;-) I wouldn't
>>like the performance penalty nor increased memory requirements of
>>having to read the results first into a model just for iterating over
>>them. One could also argue that every (reading) RDF operation involves
>>ultimately a stream/iteration of triples. Sure there's convenience
>>accesses filtering objects of the statements or select-type query
>>returning bindings, but these operations in turn rely on statement
>>iterations. [When building a generic program that doesn't have full
>>control of all input, the select-query- access is strictly speaking
>>    
>>
>not
>  
>
>>usable if "told bnodes" are not supported.]
>>    
>>
>
>We need to go back to use cases and the role of SPARQL4j.
>
>  
>
Yes :-)

>An important value of SPARQl4j (for me) is that an application, which is
>not very RDF aware, to be able to get information out of remote RDF
>repository by SPARQL query.
>  
>
However, relying on external RDF API (or InputStream) to handle certain 
kind of queries makes it highly RDF dependent and thus requires 
ultimatly that the user is not only aware of RDF but also aware of the 
(configuration dependent) RDF toolkit.

The design I'm proposing aimes

1) at providing as-natural-as-it-gets jdbc (row/column -based) behaviour 
in all cases

2) not blocking more RDF/Sparql aware use cases / applications and

3) making the decision between the desired approach explicit (i.e. an 
application that is aware of the result forms may access the results 
directly by being aware of the Sparql-specific jdbc api extensions),

4) providing the user all the necessary information and control (e.g. 
ability to define accept preferences when executing a query and access 
to the actual content type of the result) needed to process the results 
directly,

5) providing factory-based / RDF toolkit dependent getObject() accessors 
(only) as convenience accessors for RDF aware application

>JDBC is not a good match to getting RDF graphs (as we are finding!) and
>choosing one processing model (streams of something) over another makes
>assumptions about the nature and approach of the toolkit.
>  
>
That's why I'd try to avoid any direct/required dependency to the 
factory, providing it only (and possibly optionally) for convenience to 
more RDF aware users. In my opinion the triple-per-row* is 
as-good-as-it-gets alternative for row based (i.e. jdbc style) handling 
of RDF. It actually resembles one of RDF's serialization forms, namely 
NTRIPLES. Also the W3C's RDF Validator provides tabular form of the 
parsed graph, which I have found quite usefull.

*) triple-per-row:

// JDBC-specific access exposed via ResultSetMetaData
String subject = rs.getString("subject");
int subjectType = rs.getInt("subject$type"); // URI | BLANK_NODE
String predicate = ns.getString("predicate");
String object = rs.getString("object");
int objectType = rs.getInt("object$type"); // URI | BLANK_NODE | 
PLAIN_LITERAL | TYPED_LITERAL
String lang = rs.getString("object$lang");
String datatype = rs.getString("object$datatype");

// RDF toolkit specific convenience/hidden access
Resource subject = (Resource)rs.getObject("subject");
Predicate predicate = (Predicate)rs.getObject("predicate");
RDFNode object = (RDFNode)rs.getObject("object");

Note that the JDBC-specific access can easily be used to provide a 
configuration (i.e. factory) independent access to any RDF toolkit 
specific objects. It's also by far more robust way of accessing the 
toolkit specific resources than the factory approach which would end up 
in ClassCastExceptions if configuration changed. Actually the factory 
approach could aesily be replaced with simple (and robust) ResultSet 
wrappers / handlers.

>If the application wants to do listStatements AND wants triples in the
>local toolkit format, then the value of a JDBC interface (which is
>row-oriented) is pretty slim.  So I do not see a high value for SPARQl4j
>as a general connection over the SPARQL protocol in the initial
>releases.
>  
>
Triple-per-row on the other hand offers even something usefull for a not 
so RDF/Sparql/toolkit aware application.

The value of providing only sparql result form parsing is also pretty 
slim - I have actually already implemented it (just not yet committed it 
into CVS). As parsing RDF/XML (and/or N3) is much more difficult, 
providing just that in a toolkit independent way would actually provide 
extra value.

BTW Isn't it a bit contrary to the open-world view of Semantic Web, when 
you argued in your previous email that a model isn't usable unless all 
of it's statements are known?

>It would seem likely that every RDF toolkit will have a built-in SPARQL
>client so if the application is doing RDF processing, it is much better
>to use that that trying to fit around the JDBC row-oritneted paradigm.
>It's pretty easy to write (that part ARQ is quite small - some rather
>tedious HTTP connection bashing).
>  
>
So what's the point of sparql4j then?

>Also, given triples doesn't come back in any particular order from
>CONSTRUCT then I find it hard to see many processing models that can
>achieve streamability. Maybe you could sketch your use case in a little more detail?  It's that bit I'm puzzled by.
>  
>
Firstly there's a difference in building application specific (domain) 
model from a stream (of triples) or building first a generic RDF model 
and only then building the actual target model.

Secondly, even though in general case the order of triples isn't 
guaranteed, it's quite common to group the statements by the subject. In 
case the contstruct matches are streamed directly, one could assume that 
triples of a single template match would be some how grouped. Hardly in 
any case the order of returned triples is fully random.

The simplest and most obvious use case is to visualize the triples 
returned in a tabular form directly. Many GUI/WUI table widgets provide 
sorting of the rows by columns.

>[[And there aren't any told bnodes in general (but ARQ you can get them
>by setting the right config options :-)  Not sure who will support told
>bNodes.  3Store maybe.]]
>  
>
There also isn't a way of requiring that a certain type of resources 
should be URI's...

I used to build (when using Jena) RDQL queries prorammatically using the 
query API directly since that way one can (or at least could) also use 
the bnodes in queries, thus avoiding failing queries in case bnodes were 
used. Of course I was told that I shouldn't do this and to use 
models/resources accessors instead... however when using RDB model with 
fast path, I was able to achieve magnitudes of better performance this 
way. If I recall right I even used (at least at some point) the 
ResultBinding#getTriples() to process the results.

>The key is that it minimises the requirements on the client.  If we
>assume there is a complete RDF system in the client, why force ourselves
>through the JDBC paradigm when we could just as easily have a
>SPARQL-protocol specific API?  The value of SPARQL4j to me is to connect
>to applications that don't want a full RDF processing system but do want
>to get some information out of an RDF repository.
>  
>
Exactly(!) and in my opinion this should apply also to the 
CONSTRUCT/DESCRIBE queries. Such an application could do hardly anything 
with a byte stream of RDF/XML not to mention N3.

>>A graph may not be, but triples are also usable as such.
>>
>>Also I find the stream based access to the results quite usable
>>regardles of the result form - at least if it's XML and not N3 (e.g.
>>XSLT).  
>>    
>>
>
>That is XSLT on the XML results?
>  
>
Yes.

>If you mean RDF/XML, the instability of the encoding is why DAWG had to
>do a fixed schema XML results format.
>  
>
All the more reason why we should provide also RDF-parsing :-) 
Succeeding to do this in a toolkit independent way might also be usable 
to anyone building toolkits...

>>Perhaps we should discuss and document what kind of use cases we wan
>>    
>>
>to
>  
>
>>support with sparql4j? 
>>    
>>
>
>Cool - good idea.
>  
>
Let's start a separate thread for this and copy-paste results into the 
document :-)

-Samppa

-- 
Samppa Saarela <samppa.saarela at profium.com> Profium, Lars Sonckin kaari 12, 02600 Espoo, Finland Tel. +358 (0)9 855 98 000 Fax. +358 (0)9 855 98 002 Mob. +358 (0)41 515 1412  Internet: http://www.profium.com