Sunday, January 24, 2010

New version of Maui available

This weekend I have put together a Maui 1.2, a new version, which incorporates changes requested by several users, as well as some of their great ideas.

Here is a list of changes in Maui 1.2:
  1. Input files are now read using Apache commons IO Package. This makes the data reading part around 10 times faster and it also saves many lines of code.

  2. Vocabularies are now stored in GZip format (as *.gz) and are read in using GZipInputStream. This saves a lot of space, because SKOS format tends to repeat the same characters over and over. In fact, the vocabularies are now so tiny that I could easily supply them within the distribution. The SKOS files in data/vocabularies were created by cutting out all irrelevant (to Maui) information from the original files and compressing them.

  3. Stopwords are now initialized from a supplied file, rather than hard encoded one. The users can thus use their own stopwords and black listed terms.

  4. I wrote a new class MauiWrapper that shows how to apply Maui to a single text file, or a text string. Another new class MauiWrapperFactory shows how to use MauiWrapper with several vocabularies at the same time. These classes make it easy to create web services that use Maui for identifying main topics from text supplied by the web client.

  5. Finally, I have also generated a few models in data/models directory for those who don't have their own training data.
Thanks to Florian Ilgenfritz at Txtr and Jose R. Pérez-Agüera at UNC for their helpful suggestions!