The most popular collection is Wiki-20. It contains 20 computer science documents, each with main topics assigned independently by 15 teams of graduate students. So, each document has 15 sets of Wikipedia articles that represent the main topics in it, according to the students.
This data set was downloaded 19 times.
There is also a corpus that can be used for keyphrase extraction and tagging with 180 documents indexed by human taggers. It has been harvested from the data on CiteULike and includes only high quality tags. This data set, CiteULike-180) was downloaded 15 times.
There are two corpora for keyphrase indexing, i.e. assignment of terms from controlled vocabularies:
- FAO-780 with 780 documents indexed by just one indexer (9 dowloads).
- FAO-30 with 30 documents indexed independently by 6 professional indexers each (so far 8 downloads).
These statistics indicate that traditional term assignment is not a popular research topic any more. Tagging and, particularly, indexing with terms from Wikipedia is what is researched more actively. Great to see that Wikipedia gets the attention it deserves and that my idea of using it as a controlled vocabulary was picked up by others.
More information is also available on the wiki page explaining the multiply indexed data sets and on a page listing other resources for topic indexing.