Monday, November 23, 2009

Tag clouds, phrase nets, stream graphs & co

Once the key concepts in a document are known, they can be combined into a more meaningful representation than just a list. This blog post describes different methods that use visualization of document's topics and their semantic relations.

TextArc creates nets of words
TextArc was invented in 2002. The constraints in visualization tools at that time perhaps explain its rather simplistic approach. It first extracts individual words that appear in a text and stems them. Then, it computes the co-occurrences of word pairs and plots them into a very large circular graph, where important words appear in the center and less important ones are outside of the circle. Co-occurring words are placed next to each other. The graph is interactive: If a user clicks on a word, the connections to other words are activated, see the below example from Alice in Wonderland. However, without activating the words, it is pretty much impossible to read the words and the visual groupings are unclear.

TextArc for Alice in Wonderland
Tag clouds are wide-spread but far from perfect
Tag clouds were first mentioned in literature later: according to Google Scholar, in 2003. In 2005 Flickr was launched, which successfully applied tag clouds to the problem of image search and categorization. The simplicity of this model became infectious, which made them spread. However, tag clouds are often critisized for absence of structure and lack of meaning. There have been many attempts to improve tag clouds. Wordle creates typographically perfect images of tag clouds. Search for "tag clouds" on Google Scholar without the year restrain, and you will learn how to automatically cluster tags or add hierarchy to a folksonomy.

PhraseNets show what topics are connected by a given relations
PhraseNet is this year's answer to TextArc and tag clouds. It concentrates on specific co-occurrence patterns, such as "X begat Y", or a more generic one like "X of Y". Similarly to conventional tag clouds, frequent terms are shown in a larger font. Words matched to the position of X are colored in dark blue and those Y in light blue. A customized graph layout algorithm prevents words from overlapping to ensure readability. Compared to conventional tag clouds, PhraseNet adds meaning to the graphs. The below graphs demonstrate the topical shift between the Old and the New Testaments in the Bible.

PhraseNet for the Old Testament

PhraseNet for the New TestamentAn interesting finding of PhraseNet researchers was that using a dependency parser does not produce more meaningful results than just simple pattern matching. They used the Standford Parser, which required over 24h for processing 1MB of text.

Note that PhraseNets are visualization of relations rather than of concepts. For example, it can clearly describe the possessive relation with one simple pattern "X's Y", or the location relation with "X at Y". However, it cannot help with solving tasks such as visualize everything that relates to a given concept in a text or how are two concepts related in text. It also misses the notion of discourse in the text. The full paper contains more examples, as well as the discussion of undertaken application.

Topical differences in multiple texts can be visualized by combining the clouds
Tag clouds can be also meaningfully combined within a single visualization, as proposed in the study "The two sides of the same story: Laskas & Gladwell on CTE & the NFL". This time the discourse is taken into account: The algorithm positions the words vertically by their average position in text. Horizontally, the words are shifted onto the left or the right side of the graph depending on where they occur more. Those word that are similarly frequent in both graphs appear in the middle. The frequency is, as usual, expressed with varying font size.
This tool is not particularly universal, but for comparing two similar texts it works well. Using additional dimensions, one could perhaps compare three or even four texts, but adding further dimensions would negatively influence the readability and usefulness of the tool. Instead, a different kind of graph can be used for visual comparison of topics discussed in several texts.

Stream graphs demonstrate how topics evolve over time
Stream graphs were originally designed to visualize how data changes over time, e.g. movie revenues or musical taste. An approached called ThemeRiver plots how important topics change over time in underlying documents, they then compare how these changes correspond to political events that time. Or there is also an interesting demo of stream graphs in the latest twitter posts.

StreamGraph of Obama's speech

Time flow is similar to text flow, or discourse
Alternatively, the time can be also expressed as text flow in a single document. One can then stream graphs to show how which topics are mentioned in the beginning, in the end or throughout the document. The above figure is an attempt to represent the main topics in Obama's speech and their discourse flow. Stream graphs provide the suitable framework for visualizing discourse, but their potential is yet unexplored.

Splitting text into single words is harmful
The above tools use individual words as topic descriptors. A significant disadvantage is that some concepts are described as two or more words: hot dog, New York, Google Wave, Auckland University of Technology etc. Split these items into words and each one conveys a completely different meaning.

The following posts will discuss other disadvantages of single words and describe approaches that visualize concepts and phrases.