A blog about ideas relating to philoinformatics (or at least that have something to do with computer science or philosophy)

Tuesday, August 5, 2008

Large and Small Controlled Vocabularies

I've been thinking about something to make using a project called Entity Describer (ED) that my co-workers have been working on. I thought, and still think, I should make a recommender because ED should contain better tag data than basically all other tag sets because the tags are taken as a controlled vocabulary from Freebase. A controlled vocabulary gives you spelling correction and possibly disambiguation, but because the tags are from Freebase, ED also has a very large user generated vocabulary, and connections between the tags (aka semantics). But I realized that the tag data generated by ED doesn't contain information about what the users like and dislike (as of July 2008), though you could guess that most users tag things they like much more often than things they dislike.

Fortunately, ED will soon allow you to choose which vocabulary you would like to use if Freebase topics aren't right for your tagging needs. So I thought I would create a few vocabularies including:
  • Thumbs up and down
  • Agree/Disagree (more concrete, but less usable than thumbs)
  • Consistent, Valid, Sound, Straw Man, etc (for arguments, and maybe good for philosophy papers)
  • Stars (0 or 1) to (4, 5, or 10) seems popular
  • "Effects on mentality" like frightening, enlightening, and annoying (kind of like TED talk ratings)
  • Emotions (similar to effects on mentality)
    or my personal favourite
  • "like this", "want people to see this", "want to see more like this", "agree with this", and the inverses (a small set of reasons I've used thumbs up or thumbs down with the Firefox StumbleUpon add-on)
I could create these vocabularies, but why not add them to Freebase and let people use them from there? This got me thinking about something more general.

One great reason to use Freebase for tags is that there are so many. But when trying to use the tag data to make (say) a recommender, the user may not have found their ideal tags to use. So what a person actually felt about something slightly differs from what they expressed through tagging it. This doesn't happen with a sufficiently small vocabulary. You know that the user found the best description they could, but then they may not have been able to express themselves with the limited vocabulary. So, either way, big or small vocabulary, tags may not be reflecting what the user has in mind.

There are definitely ways to alleviate this problem, and the best size totally depends on exactly how and what your users are tagging. I think the above small vocabularies would all work better outside of Freebase, but tags describing the subject of a website work better within Freebase. What do you think? How significant is this problem? When is this problem most significant? Do you think a vocabulary of content types (i.e. Blog, Video, News, Article, Game, etc) is better inside or outside of Freebase?