Tuesday, December 1, 2009

Analyzing text? Think beyond words!

In previous post, I discussed how words that appear in a document can be visualized using
  • word nets, where words are nodes and their co-occurrences are links;
  • tag clouds, where font size indicates word frequency, but no relations are shown;
  • phrase nets, which show how words are connected by patterns like "X of Y";
  • stream graphs, which show how word usage changes over time.
All these approaches concentrate on words rather than meaningful entities (concepts, named entities, semantic relations)

Using single words is problematic but there are some solutions:

Problem 1: Ambiguity. Single words are limited what they express, e.g. python can mean the animal and the programming language. Where relations are not visible, the meaning gets lost.
Possible solution: Think in relations rather than single words. Semantic relations must be shown to disambiguate the meaning. They can be derived from corpora using co-occurrence statistics, or from vocabularies, thesauri,and ontologies.

Problem 2: Synonymy. Words that mean the same thing are often visualized as several entities (US, U.S., usa, u.s.a., unitedstates), which again neglects the meaning.
Possible solution: Think in concepts rather than single words. The minimum solution for working with English is to remove plural and possessives. Look into stemming, implement normalization routines. The more advanced solution is to use synonyms. Controlled vocabularies are constructed manually and help link synonymous words. Synonymy can be also computed statistically.

Problem 3: Compounds Many approaches tokenize text at white spaces, which corrupts the meaning of compound phrases and multi-word-expressions. New Zealand becomes zealand and new; data mining turns into mining and data; hotdog into dog and hot.
Possible solution: Think in n-grams rather than single words. When splitting a text into elements, be careful where you split. If two and three word phrases (2-grams and 3-grams) are included into analysis, their scores need to be additionally boosted. Alternatively, one can use vocabularies of named entities and domain-relevant phrases, or even named entity extraction algorithms.

There are many more solutions than described here, and it is hard to judge which solution is ideal. Statistics relies on availability of text collections, whereas vocabulary and thesauri are bound to be incomplete and are not always available. A universal tool that would solve the problems of single words with a minimum of requirements is yet to be implemented. If you know of one, let me know and I will post it here.