Tuesday, December 1, 2009

Analyzing text? Think beyond words!

In previous post, I discussed how words that appear in a document can be visualized using
  • word nets, where words are nodes and their co-occurrences are links;
  • tag clouds, where font size indicates word frequency, but no relations are shown;
  • phrase nets, which show how words are connected by patterns like "X of Y";
  • stream graphs, which show how word usage changes over time.
All these approaches concentrate on words rather than meaningful entities (concepts, named entities, semantic relations)

Using single words is problematic but there are some solutions:

Problem 1: Ambiguity. Single words are limited what they express, e.g. python can mean the animal and the programming language. Where relations are not visible, the meaning gets lost.
Possible solution: Think in relations rather than single words. Semantic relations must be shown to disambiguate the meaning. They can be derived from corpora using co-occurrence statistics, or from vocabularies, thesauri,and ontologies.

Problem 2: Synonymy. Words that mean the same thing are often visualized as several entities (US, U.S., usa, u.s.a., unitedstates), which again neglects the meaning.
Possible solution: Think in concepts rather than single words. The minimum solution for working with English is to remove plural and possessives. Look into stemming, implement normalization routines. The more advanced solution is to use synonyms. Controlled vocabularies are constructed manually and help link synonymous words. Synonymy can be also computed statistically.

Problem 3: Compounds Many approaches tokenize text at white spaces, which corrupts the meaning of compound phrases and multi-word-expressions. New Zealand becomes zealand and new; data mining turns into mining and data; hotdog into dog and hot.
Possible solution: Think in n-grams rather than single words. When splitting a text into elements, be careful where you split. If two and three word phrases (2-grams and 3-grams) are included into analysis, their scores need to be additionally boosted. Alternatively, one can use vocabularies of named entities and domain-relevant phrases, or even named entity extraction algorithms.

There are many more solutions than described here, and it is hard to judge which solution is ideal. Statistics relies on availability of text collections, whereas vocabulary and thesauri are bound to be incomplete and are not always available. A universal tool that would solve the problems of single words with a minimum of requirements is yet to be implemented. If you know of one, let me know and I will post it here.

5 comments:

  1. Hi Alyona,

    I wrote an IEEE Computing article on this topic a few years back: http://doi.ieeecomputersociety.org/10.110910.1109/MC.2005.375 . I also published an online version of the tool at http://ultimate-research-assistant.com/ . My approach is to use multi-word keyphrases, where keyphrases are phrases combining only those words occurring frequently in the text (minus stopwords, of course). I've gotten excellent results from this approach. Check it out and let me know what you think.

    ReplyDelete
  2. Hi Andy,
    I like your research assistant tool! Interesting way of showing query-relevant concepts and good to see that "new zealand" and "bungy jumping" are not separated.
    Your approach unfortunately has the disadvantage of choosing typical collocations that do not represent concepts, e.g. "next time" or "taking place". Also, I hope you can make it run more efficiently.
    But in general, the results are good!

    ReplyDelete
  3. Hi Alyona,
    Thanks for giving it a "test drive." I appreciate the feedback and concur with your observations, and will definitely work on making the suggested improvements. If you have any other suggestions or feedback, please feel free to contact me. Again, thanks!

    ReplyDelete
  4. I totally agree about 'moving on' from words to capturing concepts and the relations which connect them! Good point. You are sounding like an ontologist :-)

    ReplyDelete