Topic indexing blog: July 2009

Wednesday, July 29, 2009

Wiki pages about topic indexing

Here is a list of updates on Maui's google code page, which will hopefully make it easier for others to use Maui and experiment with its data sets:

Download Maui page now lists the new version (1.1) and several corpora that can be used for creating topic indexing models and testing.
A step-by-step installation guide shows how to download, install and use Maui.
There is also a full page of examples of automatically generated topics.
Multiply indexed data wiki page explains three data sets with topics assigned to the same document by multiple people. These data sets are very useful for the evaluation. This page also explains how to measure inter-indexer consistency on a simple example.
Resources for keyphrase extraction and term assignment list further useful data sets.

Let me know if something is missing.

Thursday, July 16, 2009

Updated release and French term assignment example

Turns out I had two pending issues on Google Code, where I host the Maui algorithm. Per default the project owner does not gets a notification!

So today I went ahead and fixed one of the requests: to have everything in a jar file.
I've also updated the example files and added Javadoc documentation. Soon I will publish a detailed installation instruction (additionally to the usage instructions), but for now just this one command line example. It shows how to create a topic indexing model and apply it to new document on the example of term assignment with French documents. Download the latest release of Maui (1.1) and then try this:

java -Xmx1024m -classpath maui-1.0.jar maui.main.FrenchExample

If the Java classpath is not yet linked to Maui's libraries, add this after maui-1.0.jar:

":lib/weka.jar:lib/wikipediaminer1.1.jar:lib/trove.jar:lib/jena.jar:lib/icu4j_3_4.jar:lib/iri.jar:lib/xercesImpl.jar:lib/snowball.jar:lib/mysql-connector-java-3.1.13-bin.jar:lib/maxent-2.4.0.jar:lib/commons-logging.jar"

The output should be something like:

-- Building the model...
--- Loading the vocabulary...
--- Building the Vocabulary index from the SKOS file...
...
-- Reading the input documents...
...
--- Computing candidates...
...
--- Building classifier
...
-- Extracting keyphrases...
-- Keyphrases and feature values:
http://www.fao.org/aos/agrovoc#c_4830,'Produit laitier',0,0,0.003276,0.00335,...,False
http://www.fao.org/aos/agrovoc#c_7848,Commerce,0,0,0.003348,...,2,True
http://www.fao.org/aos/agrovoc#c_3919,'Commerce international',...,True
http://www.fao.org
/aos/agrovoc#c_4826,Lait,...,False
http://www.fao.org/aos/agrovoc#c_8288,Volume,...,False
http://www.fao.org/aos/agrovoc#c_25201,Usine,...,False
http://www.fao.org/aos/agrovoc#c_714,Australie,...,False
http://www.fao.org/aos/agrovoc#c_8323,'Besoin en eau',...,False
-- 2.0 correct

-- Evaluation results based on 1 document:
Avg. number of correct keyphrases per document: 2 +/- 0
Precision: 25 +/- 0
Recall: 13.33 +/- 0
F-Measure: 17.39

For each test document (in this case, just one is used), Maui outputs its Agrovoc ID, name of the concept (e.g. Besoin en eau), some feature values and True or False, depending on whether this term as been assigned to this document by a human indexer. Based on these values evaluation is performed. Because the directory already contains a .key file, Maui does not override it, otherwise it would create one with automatically generated topics.

This is of course just a demonstration: after training on just two documents and testing on a third one. But at least it shows (I hope!) how simple Maui's usage can be.

Monday, July 13, 2009

Useful web resources related to automatic topic indexing

Tools (in alphabetical order):
for keyword and keyphrase extraction, tagging, autotagging, terminology extraction, term assignment, text classification, text categorization, topical metadata extraction and topic indexing with Wikipedia

Bibclassify - A module in CDS Invenio (CERN’s document server software) for automatic assignment of terms from SKOS vocabularies, developed on the High Energy Physics vocabulary. Developed in the collaboration between CERN and DESY. There is also a hacking guide.
Extractor - Commercial software for keyword extraction in different languages. There is also a demo. Developed at the National Research Council of Canada.
Keyphrase extraction algorithm Kea. Can be used for both automatic keyphrase extraction and term assignment with controlled vocabularies. Developed at the University of Waikato.
Multi-purpose topic indexing algorithm Maui. Suitable for automatic term assignment, subject indexing, keyword extraction, keyphrase extraction, indexing with Wikipedia, autotagging, terminology extraction. Developed at the University of Waikato. Maui is also available on sourceforge.
TerMine - a term extraction tool developed at the National Centre for Text Mining.
Topia term extractor - Part-of-speech and frequency based term extraction tool implemented in python. Here is a term extraction demo based on this tool implemented by FiveFilters.org
Orchestr8 Keyword Extraction - An API based application that uses statistical and natural language processing methods. Applicable to webpages, text files and any input text in several languages.
Wikifier – An online demo of detecting Wikipedia articles in text developed at the Language and Information Technologies research group at the University of North Texas.
Wikipedia Miner – An API for accessing Wikipedia data, which also provides a tool for mapping any document to a set of relevant Wikipedia articles, similar to indexing with Wikipedia. Developed at the University of Waikato. Demo 1 and demo 2.
SEO keyword extraction - An online keyword and keyphrase extraction tool for search engine optimization
Scorpion – OCLC’s tool for automatic classification of documents.
Tagthe.net – A demo and an API for automatic tagging of web documents and texts. Tags can be single words only. The tool also recognizes named entities such as people names and locations.
Yahoo term extraction - Web-service based content analysis via term extraction, includes a demo.

Vocabularies in SKOS format and test data:

Library of Congress Subject Headings LSCH
Medical Subject Headings thesaurus MeSH
FAO’s agricultural thesaurus Agrovoc: general info and download site.
List of other SKOS thesauri at FAO
DESY’s High Energy Physics HEP thesaurus
W3C’s list of SKOS thesauri
Maui’s datasets
Keyphrase extraction data set

More resources:

NLM Indexing Initiative – Website about National Library of Medicine’s project on automatic indexing using MeSH terms. Research details, evaluation and examples.
Dublin Core tools – A list of tools for automatic extraction of Dublin Core metadata
ASI resources – List of commercial software for back-of-the-book indexing by American Society of Indexing
ANZSI resources – List of software tools provided by the Australian and New Zealand Society of Indexing

Wednesday, July 8, 2009

What do subject indexing, keyphrase extraction and autotagging have in common? Terminology clarification

There has been a lot of confusion about tasks related to topic indexing. Here is an overview of these tasks, terms used to refer to them and what they stand for.

Text categorization (or: text classification) - Very few general categories, like Politics or News, are assigned usually from a relatively small vocabulary.
Term assignment (or: subject indexing) - Document's main topics are expressed using terms from a large vocabulary, e.g. a domain-specific thesaurus.
Keyphrase extraction (or: keyword extraction, key term extraction) - Document's main topics are expressed using the most prominent words and phrases in a document.
Terminology extraction (similar to back-of-the-book indexing) - All domain relevant words and phrases are extracted from a document.
Full-text indexing (or: full indexing, free text indexing) - All words and phrases, sometimes excluding the stopwords, are extracted from a document.
Keyphrase indexing (or: keyphrase assignment) - A general term, which refers to both term assignment and keyphrase extraction.
Tagging (or: collaborative tagging, social tagging and when performed automatically: autotagging, automatic tagging) - The user defines as many topics as desired. Any word or phrase can serve as a tag. Prevalently applied on collaborative websites.
Clustering is related to topic indexing in that it identifies groups of documents on the same topic; however, these groups are unlabeled.