A blog about ideas relating to philoinformatics (or at least that have something to do with computer science or philosophy)

Thursday, July 3, 2008

Getting rough data into the semantic web

It's nice when regular relational databases, excel spreadsheets, and other sources that people have created can be attached to the semantic web. But there are problems with the way most data is recorded, especially for data about time. In short, the problem is that people round values. This may seem like a trivial unimportant problem, but I don't think so if you want to be able to use the rough data that makes up most of the world's data. Here are the issues:

1) Datatype granularity. Datatypes allow for a wide range of possible values, but not all values. We may need to know in some situations whether the data value written is an exact match to the value that was intended. For instance, 1/3 can't be represented fully as a double and 3pm today can't be represented as a date without dropping the 3pm part.

2) Granularity used vs datatype granularity. This is a much bigger problem than (1). People often write things to a granularity that is much less fine grained than the datatype granularity. For instance, you may be recording distance to 2 decimal places and storing it as a float. This needs to be taken into account for situations where values are compared. We don't want to say that two things have the same height just because they are almost the same height. Also, people round times to the minute, 5 minutes, 10 minutes, 15 minutes, hour, and many other ways. Are they rounding down? Are they rounding to the nearest? Some other recording method?

An ontology that has commonly used recording methods to describe the data recording process would allow more justified inferences about the data to be made. This would allow a higher level of trust (in the second sense, explained here) when using the data which is going to be of great importance as the semantic web grows.

No comments: