Monday, October 19, 2009

How researchers and software engineers can make use of data in Wikipedia

Wikipedia is not just an online encyclopedia, where people can effectively and free of cost look up definitions written up by other people. It is also a very powerful resource for developing useful tools for analyzing written language. Just search for Wikipedia on Google Scholar and you will find over 140,000 research papers! This post is about practical application of data stored in Wikipedia.

At workshops held at prestigious AI conferences on Wikipedia-related research, e.g. WikiAI at IJCAI and People's Web meets NLP at ACL, I have learned about pretty amazing things one can implement using Wikipedia data. From computing semantic relatedness between words at a level comparable to humans to converting folksonomies of tags into ontologies. In general, the organizers make a differentiation between how Wikipedia can help AI in solving language-related tasks and how AI can help Wikipedia to improve its quality and fight vandalism.
Let's concentrate on the former.

A huge barrier here is that Wikipedia is huge (and gets bigger!). There are tools available for processing the Wikipedia dumps, but the best working I found so far is the open source Java library Wikipedia Miner. Not just because it was developed at the University of Waikato, where I studied. The reasons are its easy installation, an intuitive object-oriented model for accessing Wikipedia data, as well as additional tools for computing similarity between any two English phrases, wikifying any text (i.e. link its phrases to concepts explained in Wikipedia) and even implemented web services. Check out the online demos:
  • Search. If you search for a word like palm, it will list all the possible meanings, starting with the most likely one (Arecaceae - 65% likelihood) and all other meanings, like Palm (PDA) and hand palm. Clicking on a meaning shows the words used to refer to it, e.g. redirects like Palmtree and anchors like palm wood, as well as translations (language links) and broader terms (categories).

  • Compare. Here you can calculate things like vodka is more related to wine than to water.

  • Wikify. This feature helps finding out what concepts are discussed in any text or on any website. Very practical particularly for texts with many named entities, e.g. news articles, but not only. Here is this blog wikified with Wikipedia Miner (the links are added in blue and at the highest possible density).
Note: all this information can be easily accessed from your application when using the Wikipedia Miner library. I've done it in my topic indexing tool Maui and can tell you that it's relatively easy.

Many other people already are using Wikipedia Miner (100 downloads in the last 6 month). It has also been referenced as research tool in various published projects, including adding structure to search queries, finding similar videos, extending ontologies, creating language learning tools and many more.

Sunday, October 18, 2009

Data sets for keyphrase extraction and topic indexing

A few month ago I have made available several collections that researchers can use to develop and evaluate their systems for tasks related to topic indexing. But I didn't blog about it! In the meantime people still found it and have downloaded the data. Here is a summary of collections in order of their popularity.

The most popular collection is Wiki-20. It contains 20 computer science documents, each with main topics assigned independently by 15 teams of graduate students. So, each document has 15 sets of Wikipedia articles that represent the main topics in it, according to the students.
This data set was downloaded 19 times.

There is also a corpus that can be used for keyphrase extraction and tagging with 180 documents indexed by human taggers. It has been harvested from the data on CiteULike and includes only high quality tags. This data set, CiteULike-180) was downloaded 15 times.

There are two corpora for keyphrase indexing, i.e. assignment of terms from controlled vocabularies:
  1. FAO-780 with 780 documents indexed by just one indexer (9 dowloads).
  2. FAO-30 with 30 documents indexed independently by 6 professional indexers each (so far 8 downloads).
This data was provided by the Food and Agriculture Organization of the United Nations. Agrovoc serves as vocabulary, but any other vocabulary in SKOS format can be used in a similar way.

These statistics indicate that traditional term assignment is not a popular research topic any more. Tagging and, particularly, indexing with terms from Wikipedia is what is researched more actively. Great to see that Wikipedia gets the attention it deserves and that my idea of using it as a controlled vocabulary was picked up by others.

More information is also available on the wiki page explaining the multiply indexed data sets and on a page listing other resources for topic indexing.