Monday, July 13, 2009

Useful web resources related to automatic topic indexing

Tools (in alphabetical order):
for keyword and keyphrase extraction, tagging, autotagging, terminology extraction, term assignment, text classification, text categorization, topical metadata extraction and topic indexing with Wikipedia
  • Bibclassify - A module in CDS Invenio (CERN’s document server software) for automatic assignment of terms from SKOS vocabularies, developed on the High Energy Physics vocabulary. Developed in the collaboration between CERN and DESY. There is also a hacking guide.

  • Extractor - Commercial software for keyword extraction in different languages. There is also a demo. Developed at the National Research Council of Canada.

  • Keyphrase extraction algorithm Kea. Can be used for both automatic keyphrase extraction and term assignment with controlled vocabularies. Developed at the University of Waikato.

  • Multi-purpose topic indexing algorithm Maui. Suitable for automatic term assignment, subject indexing, keyword extraction, keyphrase extraction, indexing with Wikipedia, autotagging, terminology extraction. Developed at the University of Waikato. Maui is also available on sourceforge.

  • TerMine - a term extraction tool developed at the National Centre for Text Mining.

  • Topia term extractor - Part-of-speech and frequency based term extraction tool implemented in python. Here is a term extraction demo based on this tool implemented by

  • Orchestr8 Keyword Extraction - An API based application that uses statistical and natural language processing methods. Applicable to webpages, text files and any input text in several languages.

  • Wikifier – An online demo of detecting Wikipedia articles in text developed at the Language and Information Technologies research group at the University of North Texas.

  • Wikipedia Miner – An API for accessing Wikipedia data, which also provides a tool for mapping any document to a set of relevant Wikipedia articles, similar to indexing with Wikipedia. Developed at the University of Waikato. Demo 1 and demo 2.

  • SEO keyword extraction - An online keyword and keyphrase extraction tool for search engine optimization

  • Scorpion – OCLC’s tool for automatic classification of documents.

  • – A demo and an API for automatic tagging of web documents and texts. Tags can be single words only. The tool also recognizes named entities such as people names and locations.

  • Yahoo term extraction - Web-service based content analysis via term extraction, includes a demo.

Vocabularies in SKOS format and test data:
More resources:
  • NLM Indexing Initiative – Website about National Library of Medicine’s project on automatic indexing using MeSH terms. Research details, evaluation and examples.
  • Dublin Core tools – A list of tools for automatic extraction of Dublin Core metadata
  • ASI resources – List of commercial software for back-of-the-book indexing by American Society of Indexing
  • ANZSI resources – List of software tools provided by the Australian and New Zealand Society of Indexing


  1. Hey Olena,

    Thanks for these links! I also liked the earlier post explaining the difference between all these terms. :)

    I've just put up a very simple term extraction web service using a free Python package. I've also added a link to this blog for anyone interested in finding out more. I was hoping to use either Kea or Maui but having to work from documents on disk made things difficult. :)



  2. Hi Keyvan,

    cool demo, will add it to the list, as well as the other links to term extraction software you posted on your blog.


  3. Thanks a lot for sharing the resources