Topic indexing blog

Tuesday, July 6, 2010

Demo of term assignment & keyphrase extracton with Maui

It's been a while since my last post on this blog, but in the meantime Maui wasn't put to rest.

The most important news is that there is a new demo of Maui on Google AppEngine.

The main purpose of this demo is to show how Maui assigns terms from controlled vocabularies to documents. (This task is similar to text categorization with large number of categories.) The documents can be in text, Microsoft Word, or PDF format, and two kinds of vocabularies are to choose from: physics or agriculture.

The demo also shows how Maui extracts keyword. In this case, Maui was trained on 180 Computer Science documents used at the SemEval-2010 keyphrase extraction track.

Some more information on this demo can be found in my recent publication Subject Metadata Support Powered by Maui. It was co-authored with Ian H. Witten and Vye Perrone and presented at the Joint Conference on Digital Libraries in Australia last month.

Some technical notes: AppEngine has a few restrictions, which don't allow me to demo Maui's full functionality. For example, very large vocabularies cannot be uploaded to appspot, although it's possible if the demo runs locally. Also the way, Wikipedia is used in Maui is not suitable for the AppEngine framework.

Sunday, January 24, 2010

New version of Maui available

This weekend I have put together a Maui 1.2, a new version, which incorporates changes requested by several users, as well as some of their great ideas.

Here is a list of changes in Maui 1.2:

Input files are now read using Apache commons IO Package. This makes the data reading part around 10 times faster and it also saves many lines of code.
Vocabularies are now stored in GZip format (as *.gz) and are read in using GZipInputStream. This saves a lot of space, because SKOS format tends to repeat the same characters over and over. In fact, the vocabularies are now so tiny that I could easily supply them within the distribution. The SKOS files in data/vocabularies were created by cutting out all irrelevant (to Maui) information from the original files and compressing them.
Stopwords are now initialized from a supplied file, rather than hard encoded one. The users can thus use their own stopwords and black listed terms.
I wrote a new class MauiWrapper that shows how to apply Maui to a single text file, or a text string. Another new class MauiWrapperFactory shows how to use MauiWrapper with several vocabularies at the same time. These classes make it easy to create web services that use Maui for identifying main topics from text supplied by the web client.
Finally, I have also generated a few models in data/models directory for those who don't have their own training data.

Thanks to Florian Ilgenfritz at Txtr and Jose R. Pérez-Agüera at UNC for their helpful suggestions!

Tuesday, December 1, 2009

Analyzing text? Think beyond words!

In previous post, I discussed how words that appear in a document can be visualized using

word nets, where words are nodes and their co-occurrences are links;
tag clouds, where font size indicates word frequency, but no relations are shown;
phrase nets, which show how words are connected by patterns like "X of Y";
stream graphs, which show how word usage changes over time.

All these approaches concentrate on words rather than meaningful entities (concepts, named entities, semantic relations)

Using single words is problematic but there are some solutions:

Problem 1: Ambiguity. Single words are limited what they express, e.g. python can mean the animal and the programming language. Where relations are not visible, the meaning gets lost.
Possible solution: Think in relations rather than single words. Semantic relations must be shown to disambiguate the meaning. They can be derived from corpora using co-occurrence statistics, or from vocabularies, thesauri,and ontologies.

Problem 2: Synonymy. Words that mean the same thing are often visualized as several entities (US, U.S., usa, u.s.a., unitedstates), which again neglects the meaning.
Possible solution: Think in concepts rather than single words. The minimum solution for working with English is to remove plural and possessives. Look into stemming, implement normalization routines. The more advanced solution is to use synonyms. Controlled vocabularies are constructed manually and help link synonymous words. Synonymy can be also computed statistically.

Problem 3: Compounds Many approaches tokenize text at white spaces, which corrupts the meaning of compound phrases and multi-word-expressions. New Zealand becomes zealand and new; data mining turns into mining and data; hotdog into dog and hot.
Possible solution: Think in n-grams rather than single words. When splitting a text into elements, be careful where you split. If two and three word phrases (2-grams and 3-grams) are included into analysis, their scores need to be additionally boosted. Alternatively, one can use vocabularies of named entities and domain-relevant phrases, or even named entity extraction algorithms.

There are many more solutions than described here, and it is hard to judge which solution is ideal. Statistics relies on availability of text collections, whereas vocabulary and thesauri are bound to be incomplete and are not always available. A universal tool that would solve the problems of single words with a minimum of requirements is yet to be implemented. If you know of one, let me know and I will post it here.

Monday, November 23, 2009

Tag clouds, phrase nets, stream graphs & co

Once the key concepts in a document are known, they can be combined into a more meaningful representation than just a list. This blog post describes different methods that use visualization of document's topics and their semantic relations.

TextArc creates nets of words
TextArc was invented in 2002. The constraints in visualization tools at that time perhaps explain its rather simplistic approach. It first extracts individual words that appear in a text and stems them. Then, it computes the co-occurrences of word pairs and plots them into a very large circular graph, where important words appear in the center and less important ones are outside of the circle. Co-occurring words are placed next to each other. The graph is interactive: If a user clicks on a word, the connections to other words are activated, see the below example from Alice in Wonderland. However, without activating the words, it is pretty much impossible to read the words and the visual groupings are unclear.

Tag clouds are wide-spread but far from perfect
Tag clouds were first mentioned in literature later: according to Google Scholar, in 2003. In 2005 Flickr was launched, which successfully applied tag clouds to the problem of image search and categorization. The simplicity of this model became infectious, which made them spread. However, tag clouds are often critisized for absence of structure and lack of meaning. There have been many attempts to improve tag clouds. Wordle creates typographically perfect images of tag clouds. Search for "tag clouds" on Google Scholar without the year restrain, and you will learn how to automatically cluster tags or add hierarchy to a folksonomy.

PhraseNets show what topics are connected by a given relations
PhraseNet is this year's answer to TextArc and tag clouds. It concentrates on specific co-occurrence patterns, such as "X begat Y", or a more generic one like "X of Y". Similarly to conventional tag clouds, frequent terms are shown in a larger font. Words matched to the position of X are colored in dark blue and those Y in light blue. A customized graph layout algorithm prevents words from overlapping to ensure readability. Compared to conventional tag clouds, PhraseNet adds meaning to the graphs. The below graphs demonstrate the topical shift between the Old and the New Testaments in the Bible.

An interesting finding of PhraseNet researchers was that using a dependency parser does not produce more meaningful results than just simple pattern matching. They used the Standford Parser, which required over 24h for processing 1MB of text.

Note that PhraseNets are visualization of relations rather than of concepts. For example, it can clearly describe the possessive relation with one simple pattern "X's Y", or the location relation with "X at Y". However, it cannot help with solving tasks such as visualize everything that relates to a given concept in a text or how are two concepts related in text. It also misses the notion of discourse in the text. The full paper contains more examples, as well as the discussion of undertaken application.

Topical differences in multiple texts can be visualized by combining the clouds
Tag clouds can be also meaningfully combined within a single visualization, as proposed in the study "The two sides of the same story: Laskas & Gladwell on CTE & the NFL". This time the discourse is taken into account: The algorithm positions the words vertically by their average position in text. Horizontally, the words are shifted onto the left or the right side of the graph depending on where they occur more. Those word that are similarly frequent in both graphs appear in the middle. The frequency is, as usual, expressed with varying font size.
This tool is not particularly universal, but for comparing two similar texts it works well. Using additional dimensions, one could perhaps compare three or even four texts, but adding further dimensions would negatively influence the readability and usefulness of the tool. Instead, a different kind of graph can be used for visual comparison of topics discussed in several texts.

Stream graphs demonstrate how topics evolve over time
Stream graphs were originally designed to visualize how data changes over time, e.g. movie revenues or musical taste. An approached called ThemeRiver plots how important topics change over time in underlying documents, they then compare how these changes correspond to political events that time. Or there is also an interesting demo of stream graphs in the latest twitter posts.

Time flow is similar to text flow, or discourse
Alternatively, the time can be also expressed as text flow in a single document. One can then stream graphs to show how which topics are mentioned in the beginning, in the end or throughout the document. The above figure is an attempt to represent the main topics in Obama's speech and their discourse flow. Stream graphs provide the suitable framework for visualizing discourse, but their potential is yet unexplored.

Splitting text into single words is harmful
The above tools use individual words as topic descriptors. A significant disadvantage is that some concepts are described as two or more words: hot dog, New York, Google Wave, Auckland University of Technology etc. Split these items into words and each one conveys a completely different meaning.

The following posts will discuss other disadvantages of single words and describe approaches that visualize concepts and phrases.

Monday, October 19, 2009

How researchers and software engineers can make use of data in Wikipedia

Wikipedia is not just an online encyclopedia, where people can effectively and free of cost look up definitions written up by other people. It is also a very powerful resource for developing useful tools for analyzing written language. Just search for Wikipedia on Google Scholar and you will find over 140,000 research papers! This post is about practical application of data stored in Wikipedia.

At workshops held at prestigious AI conferences on Wikipedia-related research, e.g. WikiAI at IJCAI and People's Web meets NLP at ACL, I have learned about pretty amazing things one can implement using Wikipedia data. From computing semantic relatedness between words at a level comparable to humans to converting folksonomies of tags into ontologies. In general, the organizers make a differentiation between how Wikipedia can help AI in solving language-related tasks and how AI can help Wikipedia to improve its quality and fight vandalism.
Let's concentrate on the former.

A huge barrier here is that Wikipedia is huge (and gets bigger!). There are tools available for processing the Wikipedia dumps, but the best working I found so far is the open source Java library Wikipedia Miner. Not just because it was developed at the University of Waikato, where I studied. The reasons are its easy installation, an intuitive object-oriented model for accessing Wikipedia data, as well as additional tools for computing similarity between any two English phrases, wikifying any text (i.e. link its phrases to concepts explained in Wikipedia) and even implemented web services. Check out the online demos:

Search. If you search for a word like palm, it will list all the possible meanings, starting with the most likely one (Arecaceae - 65% likelihood) and all other meanings, like Palm (PDA) and hand palm. Clicking on a meaning shows the words used to refer to it, e.g. redirects like Palmtree and anchors like palm wood, as well as translations (language links) and broader terms (categories).
Compare. Here you can calculate things like vodka is more related to wine than to water.
Wikify. This feature helps finding out what concepts are discussed in any text or on any website. Very practical particularly for texts with many named entities, e.g. news articles, but not only. Here is this blog wikified with Wikipedia Miner (the links are added in blue and at the highest possible density).

Note: all this information can be easily accessed from your application when using the Wikipedia Miner library. I've done it in my topic indexing tool Maui and can tell you that it's relatively easy.

Many other people already are using Wikipedia Miner (100 downloads in the last 6 month). It has also been referenced as research tool in various published projects, including adding structure to search queries, finding similar videos, extending ontologies, creating language learning tools and many more.