Sunday, October 18, 2009

Data sets for keyphrase extraction and topic indexing

A few month ago I have made available several collections that researchers can use to develop and evaluate their systems for tasks related to topic indexing. But I didn't blog about it! In the meantime people still found it and have downloaded the data. Here is a summary of collections in order of their popularity.

The most popular collection is Wiki-20. It contains 20 computer science documents, each with main topics assigned independently by 15 teams of graduate students. So, each document has 15 sets of Wikipedia articles that represent the main topics in it, according to the students.
This data set was downloaded 19 times.

There is also a corpus that can be used for keyphrase extraction and tagging with 180 documents indexed by human taggers. It has been harvested from the data on CiteULike and includes only high quality tags. This data set, CiteULike-180) was downloaded 15 times.

There are two corpora for keyphrase indexing, i.e. assignment of terms from controlled vocabularies:
  1. FAO-780 with 780 documents indexed by just one indexer (9 dowloads).
  2. FAO-30 with 30 documents indexed independently by 6 professional indexers each (so far 8 downloads).
This data was provided by the Food and Agriculture Organization of the United Nations. Agrovoc serves as vocabulary, but any other vocabulary in SKOS format can be used in a similar way.

These statistics indicate that traditional term assignment is not a popular research topic any more. Tagging and, particularly, indexing with terms from Wikipedia is what is researched more actively. Great to see that Wikipedia gets the attention it deserves and that my idea of using it as a controlled vocabulary was picked up by others.

More information is also available on the wiki page explaining the multiply indexed data sets and on a page listing other resources for topic indexing.

No comments:

Post a Comment