Philoinformatics

Saturday, December 11, 2010

Philoinformatics and Categories of Informatics

How does philoinformatics relate to general informatics?

I will once again answer this using MS Paint. I think informatics (information science) can be usefully looked at as a kind of fan-like spectrum from general informatics to specific informatics. On the pointy handle end, you have fundamental informatics. As you move along to the edges of the fan, you find general domain informatics. At the edges on the right you have specific informatics fields such as bioinformatics, socioinformatics, and philoinformatics. Those are the grey and beige slices.

You can think of any very specific topic being placed on the edge of the fan under the title of "Informatics of X" or "X informatics" or, if you're lucky enough to have a nice prefix representing the topic, even "Xoinformatics"!

Domain Informatics
Just taking a look at the wikipedia page on informatics begs for some subcategories to help organize the discipline. The problems that are solved by the same mechanism in multiple specific informatics disciplines are more appropriately put deeper (to the left) in this fan picture, in the direction of general fundamental informatics. This is the realm of domain informatics. Domain informatics is arguably the most interesting area of informatics. Fundamental informatics is quite stable and almost completely content-neutral. While most advances in a specific informatics domain can usually be generalized to a certain point under certain conditions. I'd put things like information entropy, communication channel theory, cloud computing, and generic encryption issues in the 'fundamental informatics' category. In order to explain where philoinformatics lies in this picture, I'm going to try to identify and categorize the domain informatics.

Qualitative vs Quantitative
I think a major distinction to make when categorizing the growing amount of domain informatics is between qualitative and quantitative content. All disciplines, of course, need to deal with both quantitative and qualitative data, but some disciplines (like physics) have quantitative measurement at their cores, while other disciplines (like history) have qualitative reports and observations at their cores.

A Third Q?
Abstract disciplines like philosophy, law, math, economics, and computation, which are "removed" in a sense from direct empirical observation are interesting cases. They all seem to allow for more rigid models than qualitative observations but are generally not amenable to numerical models in the way quantitative measurements are. Unfortunately I can't think of an appropriate catchy word that starts with a 'Q' to add to the Quantity/Quality (false) dichotomy. But I think we can roughly partition all of domain informatics into Feature, Model, and Measurement Informatics. These are the yellow, blue, and red parts of my beautiful map of informatics above.

Categorizing Philoinformatics
Content in philosophy is published in chunks on the paper and book level. Some of these papers can get heavy in symbols but generally we're talking about free text and almost does a paper get heavy on numbers. Philoinformatics to handle this traditional form of philosophy becomes largely encompassed by general Publishing 2.0 initiatives which is a part of domain informatics. Registries of philosophers, registries of papers, and construing bibliographies as dereferencable (aka "followable") URIs are not unique to philosophical publications. This initiative involves simple feature informatics (by which I mean 'simple features' not 'simple task'). It's also a task that is extrinsic to philosophy in the sense that it is neutral to the content.

The more radical goal of philoinformatics that I mentioned in my philoinformatics manifesto draft involves cracking into the content itself whether by extracting from traditional publications or inventing new types of publication. Much of this content will involve trying to serialize identified ideas, concepts, and definitions that would be only available as unstructured freeform text in regular publications. As important as this is, even these items are somewhat general in that they are going to be used in all kinds of publications. But I should stress that these are the kinds of things that are currently rarely captured in a formal machine-readable kind of form, and would be a major enhancement to the entire domain.

So what content is unique to philosophy, or at least almost unique? The motivation for distinguishing philoinformatics (and any subdiscipline of general informatics) is that there is a different quality of the content that makes it somewhat unique. The content that is almost unique to philoinformatics is the handling of thought experiments, the free use of 'Xian' where X is any philosopher's name, and possibly the modeling of widespread but uniquely philosophical notions like internalist/externalist, foundational/coherentist, objective/subjective, absolute/relative, contingent/necessary. With a proper foundation with terms and links combining these items with to publications and endorsement and rejection statements, we could start computing over philosophical notions to find general properties of philosophical positions such as hidden inconsistencies, distance from evidence, robust multidirectional support and other relations that could potentially be defined in terms of this foundational data. Basically, traction and then real progress may finally be possible.

Wednesday, December 8, 2010

Sparql on Riak

Graph Data Stores
There are generally two different goals people have in mind when using graph data stores.

Graph-walking from node to node, by following edges in the graph. People usually have social networks or The Social Web in mind.
Dynamic schema or schemaless triplestores. People usually have mass fact databases (aka "knowledgebases") or The Semantic Web in mind.

But under the covers, these two concerns generally overlap as graph data stores.

Riak is a very interesting, open source, homogeneously clustered, document data store. But Riak also supports "links" between documents, which makes it a graph data store as well. These links were designed for the goal of graph-walking. An interesting feature is that this graph-walking is reduced to a series of map-reduce operations so queries are fulfilled in parallel across the cluster.

SPARQL on Riak
SPARQL is a query language that is designed for the triplestores (the 2nd goal) but I see no reason why you couldn't use it for graph data stores in general, at least in theory. So if you can reduce SPARQL queries down to a Riak queries then you automatically get your SPARQL queries reduced down to map-reduce operations. Riak even supports something similar to SPARQL property paths where you can keep intermediate results while following links, so it might not be too difficult reduce most types of SPARQL queries. One concern I have (after the main concern of whether the reduction is possible) is whether Riak can handle billions of tiny "documents", which would essentially just be URIs unless you wanted to store associated documents with each URI.

Infinitely Scalable Triplestore
One goal I would like to see achieved is an infinitely scalable triplestore. Or if "infinite" is too strong of a word, let's say a triplestore that can handle an order of magnitude more triples than the biggest triplestores out there. This SPARQL on Riak proposal might actually be able to pull this off. The query might be unbearably slow, but it should complete reliably even if it takes hours, days, or months. Creating some sort of major plug-in for Apache Hive that handles SPARQL-like queries (Hive currently supports SQL-like queries) might be the more ideal way to build an infinitely scalable distributed triplestore, but doing this would be much more difficult.

Tuesday, December 7, 2010

Subjective Consistency

I've been researching all kinds of data stores (well, actually relational, key-value, and document data stores) and I've become aware of the interesting constraint on distributed data stores known as Brewer's CAP Theorem. The idea is that you can't have Consistency, Availability, and Partition tolerance, simultaneously in any distributed data store. It looks like it's difficult to get complete consistency on a single node (see: Isolation Levels) and it's thought to be impossible to get it on a network scale (because of CAP theorem). This is where "Eventual Consistency" usually comes in, relaxing consistency for availability and partition tolerance.

Interaction-Centric Consistency
Hopefully I can frame my idea properly now that I've confused you with some terminology. My initial thought was: What kind of guarantees can a data store offer if a single user or application talks to the same node in the network? We could call this a "data session" or an "interaction". It's a kind of network transaction idea, looser than a data transaction. Anyways, I wonder if you could guarantee a stronger level of consistency by using your distributed network in this way. There might be a way to offer an apparent or subjective temporary consistency. Ultimately, the idea is that maybe if we make use of the patterns of access we expect from our users, then we won't need strict distributed consistency in the first place for a good number of applications.

Wednesday, December 1, 2010

Bridging Ontologies: The Key to Scalable Ontology Design

It's been years since I've created an ontology (in the computing/informatics sense) but I'm going to give some advice on creating them anyways. When creating an ontology, it can be helpful to connect it up to other related ontologies. In fact, I think this is a requirement for building the semantic web (taking 'ontology' in a broad sense). You may want to ground your ontology (i.e. connect it to more generic or foundational ontologies; towards upper ontologies) or connect it to well known ontologies increasing the potential usefulness and adoption of your ontology. Whatever the reason, there are benefits in doing so if you want the data that your ontology schematizes to be more easily and automatically reusable. The potential downside is that you are forcing your users to endorse the ontology you're connecting to. So how exactly should one connect their ontology into the ontology ecosystem?

Most ontologies out there seem to me to be part of a stack of ontologies built by a single group of people. The ontologies tend to build directly on top of each other, meaning "lower" ontologies directly reference "upper" ontologies. Since the ontologies are developed by a single organization, it seems to make sense to directly connect to them because the organization (arguably) knows exactly what they are attempting to represent or what they mean. The fact that organizations tend to keep their ontologies rather isolated may be caused by a fear to commit to ontologies they didn't create.

The way ontologies are (or at least should be) developed allows the possibility for changes and updates. To accommodate this, one should develop ontologies with versioning. This way, someone using your ontology won't ever have it change on them and the developers can still maintain and change the ontology by introducing new versions. It's as simple as adding a version number to your ontology's url.

But this brings up the problem we face by directly referencing other ontologies. Let's imagine you have an ontology X that makes reference to another ontology Y and that ontology Y has a newer version available. You're planning on updating a term in X to make reference to essentially the same term but in the newer version of Y to keep X up to date. So you update X to the new version of Y even though it basically hasn't changed its meaning. The role an ontology fulfills is to describe a certain subject or topic and this intrinsic meaning has not changed. Yet you still need to change your ontology. Under these conditions, no matter how much concensus is formed around the accuracy of your ontology, you will never know when it is stable. In fact, this leads to a cascading of updates and changes required by upstream ontologies that reference your ontology, and so on. This is not a distributed web-scale ontology design pattern. We need a way to decouple our ontologies.

So, is there a design pattern we can use to avoid these dangers and burdens of connecting to other ontologies? Can we do better than simply identifying good stable ontologies and directly referencing only those ontologies in our own ontology? Yes!

Introducing: Content vs Bridging Ontologies

The key to scalable ontology design is what I call Bridging Ontologies. You write your intended ontology without referencing other ontologies and then create a separate ontology that is mainly made up of owl:sameAs and rdfs:subClassOf relationships between your terms and the target ontology's terms. I call these ontologies Content Ontologies and Bridging Ontologies, respectively. You only need to update your Bridging Ontology when either the source or target ontology changes. The nature of a Bridging Ontology makes it useless for anyone to reference in their own ontologies, which stops any potential cascade of changes throughout the web of ontologies. Of course users would still need to use the Bridging Ontologies and would likely need to collapse/deflate the owl:sameAs relationships into single terms for most visualizing, processing, or reasoning purposes.

I'll go out on a limb here and say that every ontology anyone creates should be isolated in this manner. The vision then becomes a web of ontologies of small Content Ontology nodes that satisfy specific "semantic roles" and then Bridging Ontology edges definied between these Content Ontologies. Since you don't need to adopt all of the Bridging Ontologies that are built for a Content Ontology, it is much easier to reach concensus on the Content Ontologies and then to pick and choose your Bridging Ontologies, choosing to commit (or not) to exactly how that content fits into the big picture. This allows for decoupled semantics rather than traditional inflexible semantics.

Philoinformatics

Saturday, December 11, 2010