Tuesday, July 6, 2010

Demo of term assignment & keyphrase extracton with Maui

It's been a while since my last post on this blog, but in the meantime Maui wasn't put to rest.

The most important news is that there is a new demo of Maui on Google AppEngine.

The main purpose of this demo is to show how Maui assigns terms from controlled vocabularies to documents. (This task is similar to text categorization with large number of categories.) The documents can be in text, Microsoft Word, or PDF format, and two kinds of vocabularies are to choose from: physics or agriculture.

The demo also shows how Maui extracts keyword. In this case, Maui was trained on 180 Computer Science documents used at the SemEval-2010 keyphrase extraction track.

Some more information on this demo can be found in my recent publication Subject Metadata Support Powered by Maui. It was co-authored with Ian H. Witten and Vye Perrone and presented at the Joint Conference on Digital Libraries in Australia last month.

Some technical notes: AppEngine has a few restrictions, which don't allow me to demo Maui's full functionality. For example, very large vocabularies cannot be uploaded to appspot, although it's possible if the demo runs locally. Also the way, Wikipedia is used in Maui is not suitable for the AppEngine framework.

Sunday, January 24, 2010

New version of Maui available

This weekend I have put together a Maui 1.2, a new version, which incorporates changes requested by several users, as well as some of their great ideas.

Here is a list of changes in Maui 1.2:
  1. Input files are now read using Apache commons IO Package. This makes the data reading part around 10 times faster and it also saves many lines of code.

  2. Vocabularies are now stored in GZip format (as *.gz) and are read in using GZipInputStream. This saves a lot of space, because SKOS format tends to repeat the same characters over and over. In fact, the vocabularies are now so tiny that I could easily supply them within the distribution. The SKOS files in data/vocabularies were created by cutting out all irrelevant (to Maui) information from the original files and compressing them.

  3. Stopwords are now initialized from a supplied file, rather than hard encoded one. The users can thus use their own stopwords and black listed terms.

  4. I wrote a new class MauiWrapper that shows how to apply Maui to a single text file, or a text string. Another new class MauiWrapperFactory shows how to use MauiWrapper with several vocabularies at the same time. These classes make it easy to create web services that use Maui for identifying main topics from text supplied by the web client.

  5. Finally, I have also generated a few models in data/models directory for those who don't have their own training data.
Thanks to Florian Ilgenfritz at Txtr and Jose R. Pérez-Agüera at UNC for their helpful suggestions!