A blog about ideas relating to philoinformatics (or at least that have something to do with computer science or philosophy)

Tuesday, August 5, 2008

Large and Small Controlled Vocabularies

I've been thinking about something to make using a project called Entity Describer (ED) that my co-workers have been working on. I thought, and still think, I should make a recommender because ED should contain better tag data than basically all other tag sets because the tags are taken as a controlled vocabulary from Freebase. A controlled vocabulary gives you spelling correction and possibly disambiguation, but because the tags are from Freebase, ED also has a very large user generated vocabulary, and connections between the tags (aka semantics). But I realized that the tag data generated by ED doesn't contain information about what the users like and dislike (as of July 2008), though you could guess that most users tag things they like much more often than things they dislike.

Fortunately, ED will soon allow you to choose which vocabulary you would like to use if Freebase topics aren't right for your tagging needs. So I thought I would create a few vocabularies including:
  • Thumbs up and down
  • Agree/Disagree (more concrete, but less usable than thumbs)
  • Consistent, Valid, Sound, Straw Man, etc (for arguments, and maybe good for philosophy papers)
  • Stars (0 or 1) to (4, 5, or 10) seems popular
  • "Effects on mentality" like frightening, enlightening, and annoying (kind of like TED talk ratings)
  • Emotions (similar to effects on mentality)
    or my personal favourite
  • "like this", "want people to see this", "want to see more like this", "agree with this", and the inverses (a small set of reasons I've used thumbs up or thumbs down with the Firefox StumbleUpon add-on)
I could create these vocabularies, but why not add them to Freebase and let people use them from there? This got me thinking about something more general.

One great reason to use Freebase for tags is that there are so many. But when trying to use the tag data to make (say) a recommender, the user may not have found their ideal tags to use. So what a person actually felt about something slightly differs from what they expressed through tagging it. This doesn't happen with a sufficiently small vocabulary. You know that the user found the best description they could, but then they may not have been able to express themselves with the limited vocabulary. So, either way, big or small vocabulary, tags may not be reflecting what the user has in mind.

There are definitely ways to alleviate this problem, and the best size totally depends on exactly how and what your users are tagging. I think the above small vocabularies would all work better outside of Freebase, but tags describing the subject of a website work better within Freebase. What do you think? How significant is this problem? When is this problem most significant? Do you think a vocabulary of content types (i.e. Blog, Video, News, Article, Game, etc) is better inside or outside of Freebase?

Friday, July 25, 2008

Putting my old philosophy papers online

I'm going to gather all the old philosophy papers I've written over the years, which isn't very many, and post them with my current opinions of the papers. I don't know how useful or interesting this will be, but it might be fun.

Wednesday, July 16, 2008

Types of "trust" needed in semantic web

Anyone (or anything) can potentially publish RDF data. In order to know whether you should use their data, there are many issues of trust. There are (at least) two important categories of trust when it comes to the semantic web.

1) Personal Trust
Personal trust amounts to trusting that the creator of the data has the right motives. Spam is a good example. This kind of trust will become more important as the semantic web grows, but isn't a major problem yet.

2) Reliability Trust
Being able to trust data found on the semantic web requires knowing how reliable the data is. Was it created by experts? Is there a peer review mechanism? Was it created by automatic natural language processing? Currently this kind of trust is also not too important, but I believe it will become very important much faster than personal trust. We already know how to solve the reliability trust issues, with more metadata. We will need data about how the data was generated. (Note that implementing this allows more options for lying about data, and so doesn't really help with personal trust issues).

Thursday, July 3, 2008

Getting rough data into the semantic web

It's nice when regular relational databases, excel spreadsheets, and other sources that people have created can be attached to the semantic web. But there are problems with the way most data is recorded, especially for data about time. In short, the problem is that people round values. This may seem like a trivial unimportant problem, but I don't think so if you want to be able to use the rough data that makes up most of the world's data. Here are the issues:

1) Datatype granularity. Datatypes allow for a wide range of possible values, but not all values. We may need to know in some situations whether the data value written is an exact match to the value that was intended. For instance, 1/3 can't be represented fully as a double and 3pm today can't be represented as a date without dropping the 3pm part.

2) Granularity used vs datatype granularity. This is a much bigger problem than (1). People often write things to a granularity that is much less fine grained than the datatype granularity. For instance, you may be recording distance to 2 decimal places and storing it as a float. This needs to be taken into account for situations where values are compared. We don't want to say that two things have the same height just because they are almost the same height. Also, people round times to the minute, 5 minutes, 10 minutes, 15 minutes, hour, and many other ways. Are they rounding down? Are they rounding to the nearest? Some other recording method?

An ontology that has commonly used recording methods to describe the data recording process would allow more justified inferences about the data to be made. This would allow a higher level of trust (in the second sense, explained here) when using the data which is going to be of great importance as the semantic web grows.

Friday, June 13, 2008

Pellet + OWLAPI + SWRL

I've found over the last couple of months that there are a LOT of different ways to mess up when trying to use SWRL, Pellet, and OWLAPI. I'll try to list everything I can remember getting stuck on or potentially getting stuck on. Feel free to add your own by comment.

Java Heap Overflow errors:
  1. remember that you can add more memory to java in Eclipse for a single project by going to Run -> Open Run Dialog -> Arguments and then adding -Xmx1024M to your VM Arguments (or some other value instead of 1024).

  2. A great solution is to break your data into separated named graphs where you know the information will not be important between the graphs. For me, this fixed everything (especially in FaCT++ where it fixed speed issues as well). I had data about patients that I knew were not the same patient, and I reason about each patient one at a time.

  3. Still not enough? I'm not sure what to say yet, but I have some extremely unthought out guesses: try KAON2 because I heard it scales better (but might only scale for speed), try a database backend with an rdf to sql mapper (such as D2R Server), and try OWL-Lite (if expressive enough for you). None of these guesses might do anything to fix your problem, but they are what I'm going to try next if I hit this problem. Please tell me if these options would or wouldn't help.

  4. If you're using SWRL rules, try the Rete algorithm (if using a version of Pellet pre-1.5.2). Turn Rete algorithm on by writing this line: PelletOptions.USE_CONTINUOUS_RULES = true;
    I've heard there are some specific cases where this algorithm is slower, but it usually isn't.

  5. This probably won't happen to you because you are smarter than me. But make sure when you are trying to write "facts" between individuals you don't refer to classes. Write it like this:
    axioms.add( factory.getOWLObjectPropertyAssertionAxiom( personInd, hasWeightOP, weightInd ) );

    and not like this:
    OWLObjectValueRestriction hasWeightTheWeight = factory.getOWLObjectValueRestriction( hasWeightOP, weightInd );
    axioms.add( factory.getOWLClassAssertionAxiom( personInd, hasWeightTheWeight ) );

SWRL rules just won't work:

  1. Is your SWRL rule Safe? I saw a few different definitions of "SWRL rule safety" but the one that seems to be right is: Make sure all individuals that are involved are explicitly added! For example, if you have a rule like this:
    Person(?x) ^ drives(?x,?y) ^ Car(?y) -> Driver(?x)
    and you have an instance of person which is also an instance of "drives some Car", but you don't have an the instance of Car that he drives explicitly added, the rule will not return the person instance as a driver instance. For more check this part of the ProtegeWiki: SWRLLanguageFAQ.

  2. Simplify rules as much as possible. Don't use negation (NOT) and don't use disjunction (OR). I've never had a problem with this, or even tried it, but I seem to remember people having problems with this... somewhere.

Reasoning takes too long!
  1. I'm still working on this... but reducing expressivity can't hurt. It sounds like OWL-Lite would make a huge difference for speed, but you have to sacrifice so much that personally, it doesn't seem like it's ever worth it unless you want structured tags and don't care about answering queries.
  2. Try FaCT++. It was orders of magnitude faster when I used iterative reasoning as explained above.