Philoinformatics: Sparql on Riak

Graph Data Stores
There are generally two different goals people have in mind when using graph data stores.

Graph-walking from node to node, by following edges in the graph. People usually have social networks or The Social Web in mind.
Dynamic schema or schemaless triplestores. People usually have mass fact databases (aka "knowledgebases") or The Semantic Web in mind.

But under the covers, these two concerns generally overlap as graph data stores.

Riak is a very interesting, open source, homogeneously clustered, document data store. But Riak also supports "links" between documents, which makes it a graph data store as well. These links were designed for the goal of graph-walking. An interesting feature is that this graph-walking is reduced to a series of map-reduce operations so queries are fulfilled in parallel across the cluster.

SPARQL on Riak
SPARQL is a query language that is designed for the triplestores (the 2nd goal) but I see no reason why you couldn't use it for graph data stores in general, at least in theory. So if you can reduce SPARQL queries down to a Riak queries then you automatically get your SPARQL queries reduced down to map-reduce operations. Riak even supports something similar to SPARQL property paths where you can keep intermediate results while following links, so it might not be too difficult reduce most types of SPARQL queries. One concern I have (after the main concern of whether the reduction is possible) is whether Riak can handle billions of tiny "documents", which would essentially just be URIs unless you wanted to store associated documents with each URI.

Infinitely Scalable Triplestore
One goal I would like to see achieved is an infinitely scalable triplestore. Or if "infinite" is too strong of a word, let's say a triplestore that can handle an order of magnitude more triples than the biggest triplestores out there. This SPARQL on Riak proposal might actually be able to pull this off. The query might be unbearably slow, but it should complete reliably even if it takes hours, days, or months. Creating some sort of major plug-in for Apache Hive that handles SPARQL-like queries (Hive currently supports SQL-like queries) might be the more ideal way to build an infinitely scalable distributed triplestore, but doing this would be much more difficult.

Philoinformatics

Wednesday, December 8, 2010

Sparql on Riak

1 comment:

Blog Archive

Labels