Monday, October 19, 2009

How researchers and software engineers can make use of data in Wikipedia

Wikipedia is not just an online encyclopedia, where people can effectively and free of cost look up definitions written up by other people. It is also a very powerful resource for developing useful tools for analyzing written language. Just search for Wikipedia on Google Scholar and you will find over 140,000 research papers! This post is about practical application of data stored in Wikipedia.

At workshops held at prestigious AI conferences on Wikipedia-related research, e.g. WikiAI at IJCAI and People's Web meets NLP at ACL, I have learned about pretty amazing things one can implement using Wikipedia data. From computing semantic relatedness between words at a level comparable to humans to converting folksonomies of tags into ontologies. In general, the organizers make a differentiation between how Wikipedia can help AI in solving language-related tasks and how AI can help Wikipedia to improve its quality and fight vandalism.
Let's concentrate on the former.

A huge barrier here is that Wikipedia is huge (and gets bigger!). There are tools available for processing the Wikipedia dumps, but the best working I found so far is the open source Java library Wikipedia Miner. Not just because it was developed at the University of Waikato, where I studied. The reasons are its easy installation, an intuitive object-oriented model for accessing Wikipedia data, as well as additional tools for computing similarity between any two English phrases, wikifying any text (i.e. link its phrases to concepts explained in Wikipedia) and even implemented web services. Check out the online demos:
  • Search. If you search for a word like palm, it will list all the possible meanings, starting with the most likely one (Arecaceae - 65% likelihood) and all other meanings, like Palm (PDA) and hand palm. Clicking on a meaning shows the words used to refer to it, e.g. redirects like Palmtree and anchors like palm wood, as well as translations (language links) and broader terms (categories).

  • Compare. Here you can calculate things like vodka is more related to wine than to water.

  • Wikify. This feature helps finding out what concepts are discussed in any text or on any website. Very practical particularly for texts with many named entities, e.g. news articles, but not only. Here is this blog wikified with Wikipedia Miner (the links are added in blue and at the highest possible density).
Note: all this information can be easily accessed from your application when using the Wikipedia Miner library. I've done it in my topic indexing tool Maui and can tell you that it's relatively easy.

Many other people already are using Wikipedia Miner (100 downloads in the last 6 month). It has also been referenced as research tool in various published projects, including adding structure to search queries, finding similar videos, extending ontologies, creating language learning tools and many more.

8 comments:

  1. Since Maui is language independent, ¿can I infer that Wikipedia Miner is also language-independent?

    ReplyDelete
  2. Theoretically, yes! I know that people are using Wikipedia Miner with other languages as well. The approach is definitely language-independent. There are some issues with inconsistencies with how categories represented across Wikipedia languages, but Dave Milne (the creator) has been working on these.

    I am not sure how much work is involved in making it work on Wikipedia dumps in other languages, because I haven't tried it out yet. You'd need to know some perl and hack the download scripts (i think).

    ReplyDelete
  3. The approach is definitely language-independent. There are some issues with inconsistencies with how categories represented across Wikipedia languages.
    __________________
    Kim

    ReplyDelete
  4. Good point, Kim. Have you done any work on this or could point us to work by others? Thanks, Alyona

    ReplyDelete
  5. Any chance a version of Maui will come out which supports Wikipedia-miner 1.2?

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. No, unfortunately this won't be possible from my side. However, Maui is an open-source project. Anybody can contribute! See

    maui-indexer.googlecode.com

    ReplyDelete