Sunday, January 24, 2010

New version of Maui available

This weekend I have put together a Maui 1.2, a new version, which incorporates changes requested by several users, as well as some of their great ideas.

Here is a list of changes in Maui 1.2:
  1. Input files are now read using Apache commons IO Package. This makes the data reading part around 10 times faster and it also saves many lines of code.

  2. Vocabularies are now stored in GZip format (as *.gz) and are read in using GZipInputStream. This saves a lot of space, because SKOS format tends to repeat the same characters over and over. In fact, the vocabularies are now so tiny that I could easily supply them within the distribution. The SKOS files in data/vocabularies were created by cutting out all irrelevant (to Maui) information from the original files and compressing them.

  3. Stopwords are now initialized from a supplied file, rather than hard encoded one. The users can thus use their own stopwords and black listed terms.

  4. I wrote a new class MauiWrapper that shows how to apply Maui to a single text file, or a text string. Another new class MauiWrapperFactory shows how to use MauiWrapper with several vocabularies at the same time. These classes make it easy to create web services that use Maui for identifying main topics from text supplied by the web client.

  5. Finally, I have also generated a few models in data/models directory for those who don't have their own training data.
Thanks to Florian Ilgenfritz at Txtr and Jose R. Pérez-Agüera at UNC for their helpful suggestions!

9 comments:

  1. Alyona
    I think you somewhere mentioned that some part of the Maui only works with English language. Which part? Is that part open source? Any idea if that would be easy to modify to support other languages too?

    ReplyDelete
  2. Hi Jouko,

    the approach used in Maui is in principle language-independent. I did a proof of concept during my internship at Google, when I have implemented a Maui-like system that works on any language, including Chinese, Hebrew and Russian. However, that was using Google's language tools.

    I have tested the current version of Maui on French, German and Spanish. If you have a stemmer in your language and a list of stopwords, it might work already.

    However, it doesn't implement a text segmenter, required for some languages.

    Furthermore, the Wikipedia features are currently only computed using the English Wikipedia, so that's another part that would need to be modified performance. How easy it is will depend on the language.

    Note that all parts of Maui are open-source.

    Hope this answers your questions!

    ReplyDelete
  3. Hi. Great Work... ;)
    we used Maui in a "Proof of Concept project" in our work... you can see some details here:

    http://blog.gkudos.com/2010/03/05/observatory-of-presidential-elections-colombia-2010/

    as a suggestion... it would be nice to include a better approach for logging in maui..
    (in our "modifed" version of maui... we included some logging facilities using log4j... it let us to get a better "view" of the internals of maui...

    great work
    greets from bogotá

    ReplyDelete
  4. Hi Juan,

    cool website! Love the Flex visualization. Glad you could use Maui, and your comments regarding the logging are very true. I will have to add this in the next version.

    Could you please point me on your demo as to where exactly Maui was used. In "Themas Destacadas"? To assign Named Entities as well?
    Did you use it on Spanish text as well as English? Run into any difficulties doing so?

    Thanks in advance,
    Alyona

    ReplyDelete
  5. Hi. Thanks for visit our site...

    In our demo we used maui to extract the following information from the news (Term assignment using an ontology):

    - Politicians (Presidential Candidates, vice president and Pre-Candidates)
    - Political Parties
    - Relevant keywords (temas destacados). For example: Poverty, Economy, Employment, Education, Security, Corruption...

    We just used spanish language in our demo... (in order to reach a broader "audience" of readers in our country)
    In general the biggest issues were Ambiguity and Synonymy (the common problems for everyone... ).

    But there are a couple of "funny issues" found during our experiment:

    - How to extract the "most relevant" terms from limited information sources such as RSS news or twitter feeds? (Few RSS sources include "full text" from the online newspapers...)

    - Due to the lack of availability of an official "political thesauri", We created a little ontology (using skos) reflecting some aspects of the current presidential campaing in Colombia ( using names, synonyms, nicknames,
    abbreviations...).
    As you could imagine our little "hand made" ontology is too simple for complex problems, but for our proof of concept was good enough (25 politicians, 10 political parties and 15 "relevant keywords")

    - Our first Georreferencing experiment using maui didn`t work well, so we used a Gazetteer and a "Exact Dictionary-Based Chunking" approach ( Aho-Corasick Algorithm implemented by lingpipe).
    We didn´t have enough time to try a more sophisticated location extraction algorithm using more advanced NLP techniques.

    The main goal of our experiment was to create a subject-specific easy-to-use application using text analysis, spatial databases and a web based visualization.
    Our challenge (in spite of our informal approach) was to be able to develop such kind of project constrained by few resources and a Resource-Limited Schedule.

    We learned a lot reviewing your work...
    We expect to have more time to get a deeper understanding of the subject to be applied in future projects.

    thank for your time and patience.

    Juan Carlos

    ReplyDelete
  6. As you said your system is "Language Independent".

    I want to know the exact parameter to say a system as "Language Independent".

    CASE-1: when we are using Language specific parser, stammer and stopwords.

    CASE-2: when we are using only stopwords and stammers.

    In both case Which case you consider near to Language Independent ? and what level of Language independence is used by MAUI-Tool ?

    ReplyDelete
  7. It's a good question.

    In my opinion, if one says "a system is language-independent" they imply that "not much work is required to get the system run on a new language".

    In your categories, I would say Case-2 is definitely "language independent", because it's easy to compile a list of stopwords, and stemmers are generic resources that are available in many languages. For most tasks, Maui does not require more than that.

    An example of a language-dependent approach would be a system that uses access to resources that are only available in a particular language. It can be a complex syntactical parser or a lexical database.

    ReplyDelete
  8. Hi Madam,

    But MAUI system uses huge Vocabulary support related to different domain, As you mentioned in system description. So can it be phrases like system is language independent i.e. uses only stop-words and stammers, but for phrase identification it uses these Vocabularies and for extracting final set of key-phrases it uses Naive-Bayesian learning ? This is just My personal understanding about your system. Am I true ?

    Thanks & Regards
    Niraj Kumar

    ReplyDelete
  9. This comment has been removed by a blog administrator.

    ReplyDelete