Friday, June 26, 2009

Document keywords from Wikipedia

Maui can be used for identifying the main topics in a document and expressing them in form of well-formed titles of Wikipedia articles.

To create a training model I used a collection with 20 computer science papers indexed by graduate students in an indexing experiment for my AAAI paper "Topic indexing with Wikipedia".

Then I applied this model to generate ten keywords and keyphrases for the Introduction chapter of my PhD Thesis. I am very pleased with the results:


The best thing about keyphrase extraction with Wikipedia is that it links each term to a concept -- Wikipedia article explaining this term. Thus, the resulting keyphrase sets will be consistent: they will all refer to "Dissertation" and not "PhD thesis" or "Doctoral thesis", disregarding of which term is actually used in the document to express this concept.

Each topic can be directly linked to the Wikipedia article by adding "http://en.wikipedia.org/wiki/" in front of the term as I did in the above list.

2 comments:

  1. Interesting stuff. Is the Wiki20 dataset the one being mentioned here? I guess so and will see about the README inside it if any. BTW, the links in pointing to Waikato pages this post points to seem to be broken

    ReplyDelete
  2. Yes, that's correct. Thanks for pointing out the broken links. Unfortunately, the content is gone now...

    ReplyDelete