Saturday, May 30, 2009

What is Maui about?

The Maui topic indexing algorithm was created as a part of my PhD in Computer Science at the University of Waikato.
The title of the PhD thesis is

Human-competitive automatic topic indexing


Here is its abstract, which sums up what the algorithm is about:

Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document’s topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance.

Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples.

This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is “human-competitive” because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages.

Tuesday, May 19, 2009

Maui - first file release!

Today I have submitted the first version of Maui to Sourceforge and currently adding Maui to Google code. In a later post I am planning to write about what Maui does, as well as about my experience with hosting and versioning an open source project on both platforms. But for now, this is to be the first post on this blog.