RE: [Sparql4j-devel] RE: XML parsing

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

-------- Original Message --------
> From: Samppa Saarela <>
> Date: 5 January 2006 09:50
>=20
. . . .
> > An important value of SPARQl4j (for me) is that an application,
which
> > is=20
> > not very RDF aware, to be able to get information out of remote RDF
> > repository by SPARQL query.
> >=20
> >=20
> However, relying on external RDF API (or InputStream) to handle
certain
> kind of queries makes it highly RDF dependent and thus requires
> ultimatly that the user is not only aware of RDF but also aware of the
> (configuration dependent) RDF toolkit.
>=20
> The design I'm proposing aimes
>=20
> 1) at providing as-natural-as-it-gets jdbc (row/column -based)
behaviour
> in all cases
>=20
> 2) not blocking more RDF/Sparql aware use cases / applications and
>=20
> 3) making the decision between the desired approach explicit (i.e. an
> application that is aware of the result forms may access the results
> directly by being aware of the Sparql-specific jdbc api extensions),
>=20
> 4) providing the user all the necessary information and control (e.g.
> ability to define accept preferences when executing a query and access
> to the actual content type of the result) needed to process the
results
> directly,
>=20
> 5) providing factory-based / RDF toolkit dependent getObject()
accessors
> (only) as convenience accessors for RDF aware application
>=20
> > JDBC is not a good match to getting RDF graphs (as we are finding!)
> > and=20
> > choosing one processing model (streams of something) over another
> > makes=20
> > assumptions about the nature and approach of the toolkit.
> >=20
> >=20
> That's why I'd try to avoid any direct/required dependency to the
> factory, providing it only (and possibly optionally) for convenience
to
> more RDF aware users. In my opinion the triple-per-row* is
> as-good-as-it-gets alternative for row based (i.e. jdbc style)
handling
> of RDF. It actually resembles one of RDF's serialization forms, namely
> NTRIPLES. Also the W3C's RDF Validator provides tabular form of the
> parsed graph, which I have found quite usefull.
>=20
> *) triple-per-row:
>=20
> // JDBC-specific access exposed via ResultSetMetaData
> String subject =3D rs.getString("subject");
> int subjectType =3D rs.getInt("subject$type"); // URI | BLANK_NODE
> String predicate =3D ns.getString("predicate");
> String object =3D rs.getString("object");
> int objectType =3D rs.getInt("object$type"); // URI | BLANK_NODE |
> PLAIN_LITERAL | TYPED_LITERAL
> String lang =3D rs.getString("object$lang");
> String datatype =3D rs.getString("object$datatype");

That looks just like like

SELECT ?subject ?predicate ?object { ... pattern ... }

+ the accessors for type and lang

>=20
> // RDF toolkit specific convenience/hidden access
> Resource subject =3D (Resource)rs.getObject("subject");
> Predicate predicate =3D (Predicate)rs.getObject("predicate");
> RDFNode object =3D (RDFNode)rs.getObject("object");
>=20
> Note that the JDBC-specific access can easily be used to provide a
> configuration (i.e. factory) independent access to any RDF toolkit
> specific objects. It's also by far more robust way of accessing the
> toolkit specific resources than the factory approach which would end
up
> in ClassCastExceptions if configuration changed. Actually the factory
> approach could aesily be replaced with simple (and robust) ResultSet
> wrappers / handlers.
>=20
> > If the application wants to do listStatements AND wants triples in
the
> > local toolkit format, then the value of a JDBC interface (which is
> > row-oriented) is pretty slim.  So I do not see a high value for
> > SPARQl4j=20
> > as a general connection over the SPARQL protocol in the initial
> > releases.
> >=20
> >=20
> Triple-per-row on the other hand offers even something usefull for a
not
> so RDF/Sparql/toolkit aware application.
>=20
> The value of providing only sparql result form parsing is also pretty
> slim - I have actually already implemented it (just not yet committed
it
> into CVS). As parsing RDF/XML (and/or N3) is much more difficult,
> providing just that in a toolkit independent way would actually
provide
> extra value.
>=20
> BTW Isn't it a bit contrary to the open-world view of Semantic Web,
when
> you argued in your previous email that a model isn't usable unless all
> of it's statements are known?

No, not at all.  Entailment is defined on models.

Open world says there may be other statements out there in other models.

An application displaying the results of a query wants to know when it
has seen all the results for its query on its choosen dada source. e.g.
listing people by name makes an ordering assumption.  When you see the
"K"'s, there are no more "J"'s in this result.

>=20
> > It would seem likely that every RDF toolkit will have a built-in
> > SPARQL=20
> > client so if the application is doing RDF processing, it is much
> > better=20
> > to use that that trying to fit around the JDBC row-oritneted
paradigm.
> > It's pretty easy to write (that part ARQ is quite small - some
rather
> > tedious HTTP connection bashing).
> >=20
> >=20
> So what's the point of sparql4j then?

SELECT queries for application that wish to access RDF information.
For example, a RDF repository as the core of a 3-tier web site.  No need
for the business logic to have an RDF toolkit if it just wants to get
people's name.

If an application is going to do RDF processing on graphs, it would not
want to use JDBC's view of row-by-row.  Why not just get the constructed
graph in whatever way its toolkit wants?  Because there is a paradigm
shift, the value of JDBC to moving graphs around seems limited to me.

But SELECT queries, to get information out of RDF graphs, are the most
important kind of query.  Feed this into JDBC environments in the JDBC
pardigm and we may even be able to reuse JDBC client tools.

>=20
> > Also, given triples doesn't come back in any particular order from
> > CONSTRUCT then I find it hard to see many processing models that can
> > achieve streamability. Maybe you could sketch your use case in a
> > little more detail?  It's that bit I'm puzzled by.=20
> >=20
> >=20
> Firstly there's a difference in building application specific (domain)
> model from a stream (of triples) or building first a generic RDF model
> and only then building the actual target model.
>=20
> Secondly, even though in general case the order of triples isn't
> guaranteed, it's quite common to group the statements by the subject.

I disagree - you're relying on the server having processed the entire
model to make the statements come out nicely.  Jena uses hashes all over
the place - things do not come out in any sort of order, nor is it
consistent.

> In
> case the contstruct matches are streamed directly, one could assume
that
> triples of a single template match would be some how grouped. Hardly
in
> any case the order of returned triples is fully random.
>=20
> The simplest and most obvious use case is to visualize the triples
> returned in a tabular form directly. Many GUI/WUI table widgets
provide
> sorting of the rows by columns.
>=20
> > [[And there aren't any told bnodes in general (but ARQ you can get
> > them=20
> > by setting the right config options :-)  Not sure who will support
> > told=20
> > bNodes.  3Store maybe.]]
> >=20
> >=20
> There also isn't a way of requiring that a certain type of resources
> should be URI's...
>=20
> I used to build (when using Jena) RDQL queries prorammatically using
the
> query API directly since that way one can (or at least could) also use
> the bnodes in queries, thus avoiding failing queries in case bnodes
were
> used. Of course I was told that I shouldn't do this and to use
> models/resources accessors instead... however when using RDB model
with
> fast path, I was able to achieve magnitudes of better performance this
> way. If I recall right I even used (at least at some point) the
> ResultBinding#getTriples() to process the results.

But it would not have worked with Joseki :-) bNodes not preserved across
the network.

You might like to see the ARQ configuration options I just put in so
bnode ids are passed transparently across the network.  It's not the
default mode (makes the XML results look ugly).

>=20
> > The key is that it minimises the requirements on the client.  If we
> > assume there is a complete RDF system in the client, why force
> > ourselves=20
> > through the JDBC paradigm when we could just as easily have a
> > SPARQL-protocol specific API?  The value of SPARQL4j to me is to
> > connect=20
> > to applications that don't want a full RDF processing system but do
> > want=20
> > to get some information out of an RDF repository.
> >=20
> >=20
> Exactly(!) and in my opinion this should apply also to the
> CONSTRUCT/DESCRIBE queries. Such an application could do hardly
anything
> with a byte stream of RDF/XML not to mention N3.
>=20
> > > A graph may not be, but triples are also usable as such.
> > >=20
> > > Also I find the stream based access to the results quite usable
> > > regardles of the result form - at least if it's XML and not N3
> > > (e.g. XSLT).=20
> > >=20
> > >=20
> >=20
> > That is XSLT on the XML results?
> >=20
> >=20
> Yes.

So it's a SELECT query.

>=20
> > If you mean RDF/XML, the instability of the encoding is why DAWG had
> > to=20
> > do a fixed schema XML results format.
> >=20
> >=20
> All the more reason why we should provide also RDF-parsing :-)
> Succeeding to do this in a toolkit independent way might also be
usable
> to anyone building toolkits...

Then the objective of the project is now providing another API to RDF
(wrapping all toolkits is just like a new toolkit that uses others as
its implementation.  I have a toolkit, and a SPARQL-protocol interface
that (I think) is easiler to work with than warping to the JDBC paradign
and and waroping back again - it's cognitive bruden on the app writer.

I wanted an approach of doing something clearcut, minimal and distinct
with a clear value.  But this now seems to be growing into a general
purpose RDF framework.  Can we find some limits please?

	Andy

>=20
> > > Perhaps we should discuss and document what kind of use cases we
wan
> > >=20
> > >=20
> > to
> >=20
> >=20
> > > support with sparql4j?
> > >=20
> > >=20
> >=20
> > Cool - good idea.
> >=20
> >=20
> Let's start a separate thread for this and copy-paste results into the
> document :-)
>=20
> -Samppa
>=20
> --
> Samppa Saarela <samppa.saarela at profium.com> Profium, Lars Sonckin
> kaari 12, 02600 Espoo, Finland Tel. +358 (0)9 855 98 000 Fax. +358
(0)9
> 855 98 002 Mob. +358 (0)41 515 1412  Internet: http://www.profium.com