To create a training model I used a collection with 20 computer science papers indexed by graduate students in an indexing experiment for my AAAI paper "Topic indexing with Wikipedia".
Then I applied this model to generate ten keywords and keyphrases for the Introduction chapter of my PhD Thesis. I am very pleased with the results:
- Index (search engine)
- Natural language
- Index (information technology)
- Machine learning
- Training set
- Information retrieval
The best thing about keyphrase extraction with Wikipedia is that it links each term to a concept -- Wikipedia article explaining this term. Thus, the resulting keyphrase sets will be consistent: they will all refer to "Dissertation" and not "PhD thesis" or "Doctoral thesis", disregarding of which term is actually used in the document to express this concept.
Each topic can be directly linked to the Wikipedia article by adding "http://en.wikipedia.org/wiki/" in front of the term as I did in the above list.