Tuesday, December 1, 2009

Analyzing text? Think beyond words!

In previous post, I discussed how words that appear in a document can be visualized using
  • word nets, where words are nodes and their co-occurrences are links;
  • tag clouds, where font size indicates word frequency, but no relations are shown;
  • phrase nets, which show how words are connected by patterns like "X of Y";
  • stream graphs, which show how word usage changes over time.
All these approaches concentrate on words rather than meaningful entities (concepts, named entities, semantic relations)

Using single words is problematic but there are some solutions:

Problem 1: Ambiguity. Single words are limited what they express, e.g. python can mean the animal and the programming language. Where relations are not visible, the meaning gets lost.
Possible solution: Think in relations rather than single words. Semantic relations must be shown to disambiguate the meaning. They can be derived from corpora using co-occurrence statistics, or from vocabularies, thesauri,and ontologies.

Problem 2: Synonymy. Words that mean the same thing are often visualized as several entities (US, U.S., usa, u.s.a., unitedstates), which again neglects the meaning.
Possible solution: Think in concepts rather than single words. The minimum solution for working with English is to remove plural and possessives. Look into stemming, implement normalization routines. The more advanced solution is to use synonyms. Controlled vocabularies are constructed manually and help link synonymous words. Synonymy can be also computed statistically.

Problem 3: Compounds Many approaches tokenize text at white spaces, which corrupts the meaning of compound phrases and multi-word-expressions. New Zealand becomes zealand and new; data mining turns into mining and data; hotdog into dog and hot.
Possible solution: Think in n-grams rather than single words. When splitting a text into elements, be careful where you split. If two and three word phrases (2-grams and 3-grams) are included into analysis, their scores need to be additionally boosted. Alternatively, one can use vocabularies of named entities and domain-relevant phrases, or even named entity extraction algorithms.

There are many more solutions than described here, and it is hard to judge which solution is ideal. Statistics relies on availability of text collections, whereas vocabulary and thesauri are bound to be incomplete and are not always available. A universal tool that would solve the problems of single words with a minimum of requirements is yet to be implemented. If you know of one, let me know and I will post it here.

Monday, November 23, 2009

Tag clouds, phrase nets, stream graphs & co

Once the key concepts in a document are known, they can be combined into a more meaningful representation than just a list. This blog post describes different methods that use visualization of document's topics and their semantic relations.

TextArc creates nets of words
TextArc was invented in 2002. The constraints in visualization tools at that time perhaps explain its rather simplistic approach. It first extracts individual words that appear in a text and stems them. Then, it computes the co-occurrences of word pairs and plots them into a very large circular graph, where important words appear in the center and less important ones are outside of the circle. Co-occurring words are placed next to each other. The graph is interactive: If a user clicks on a word, the connections to other words are activated, see the below example from Alice in Wonderland. However, without activating the words, it is pretty much impossible to read the words and the visual groupings are unclear.

TextArc for Alice in Wonderland
Tag clouds are wide-spread but far from perfect
Tag clouds were first mentioned in literature later: according to Google Scholar, in 2003. In 2005 Flickr was launched, which successfully applied tag clouds to the problem of image search and categorization. The simplicity of this model became infectious, which made them spread. However, tag clouds are often critisized for absence of structure and lack of meaning. There have been many attempts to improve tag clouds. Wordle creates typographically perfect images of tag clouds. Search for "tag clouds" on Google Scholar without the year restrain, and you will learn how to automatically cluster tags or add hierarchy to a folksonomy.

PhraseNets show what topics are connected by a given relations
PhraseNet is this year's answer to TextArc and tag clouds. It concentrates on specific co-occurrence patterns, such as "X begat Y", or a more generic one like "X of Y". Similarly to conventional tag clouds, frequent terms are shown in a larger font. Words matched to the position of X are colored in dark blue and those Y in light blue. A customized graph layout algorithm prevents words from overlapping to ensure readability. Compared to conventional tag clouds, PhraseNet adds meaning to the graphs. The below graphs demonstrate the topical shift between the Old and the New Testaments in the Bible.

PhraseNet for the Old Testament

PhraseNet for the New TestamentAn interesting finding of PhraseNet researchers was that using a dependency parser does not produce more meaningful results than just simple pattern matching. They used the Standford Parser, which required over 24h for processing 1MB of text.

Note that PhraseNets are visualization of relations rather than of concepts. For example, it can clearly describe the possessive relation with one simple pattern "X's Y", or the location relation with "X at Y". However, it cannot help with solving tasks such as visualize everything that relates to a given concept in a text or how are two concepts related in text. It also misses the notion of discourse in the text. The full paper contains more examples, as well as the discussion of undertaken application.

Topical differences in multiple texts can be visualized by combining the clouds
Tag clouds can be also meaningfully combined within a single visualization, as proposed in the study "The two sides of the same story: Laskas & Gladwell on CTE & the NFL". This time the discourse is taken into account: The algorithm positions the words vertically by their average position in text. Horizontally, the words are shifted onto the left or the right side of the graph depending on where they occur more. Those word that are similarly frequent in both graphs appear in the middle. The frequency is, as usual, expressed with varying font size.
This tool is not particularly universal, but for comparing two similar texts it works well. Using additional dimensions, one could perhaps compare three or even four texts, but adding further dimensions would negatively influence the readability and usefulness of the tool. Instead, a different kind of graph can be used for visual comparison of topics discussed in several texts.

Stream graphs demonstrate how topics evolve over time
Stream graphs were originally designed to visualize how data changes over time, e.g. movie revenues or musical taste. An approached called ThemeRiver plots how important topics change over time in underlying documents, they then compare how these changes correspond to political events that time. Or there is also an interesting demo of stream graphs in the latest twitter posts.

StreamGraph of Obama's speech

Time flow is similar to text flow, or discourse
Alternatively, the time can be also expressed as text flow in a single document. One can then stream graphs to show how which topics are mentioned in the beginning, in the end or throughout the document. The above figure is an attempt to represent the main topics in Obama's speech and their discourse flow. Stream graphs provide the suitable framework for visualizing discourse, but their potential is yet unexplored.

Splitting text into single words is harmful
The above tools use individual words as topic descriptors. A significant disadvantage is that some concepts are described as two or more words: hot dog, New York, Google Wave, Auckland University of Technology etc. Split these items into words and each one conveys a completely different meaning.

The following posts will discuss other disadvantages of single words and describe approaches that visualize concepts and phrases.

Monday, October 19, 2009

How researchers and software engineers can make use of data in Wikipedia

Wikipedia is not just an online encyclopedia, where people can effectively and free of cost look up definitions written up by other people. It is also a very powerful resource for developing useful tools for analyzing written language. Just search for Wikipedia on Google Scholar and you will find over 140,000 research papers! This post is about practical application of data stored in Wikipedia.

At workshops held at prestigious AI conferences on Wikipedia-related research, e.g. WikiAI at IJCAI and People's Web meets NLP at ACL, I have learned about pretty amazing things one can implement using Wikipedia data. From computing semantic relatedness between words at a level comparable to humans to converting folksonomies of tags into ontologies. In general, the organizers make a differentiation between how Wikipedia can help AI in solving language-related tasks and how AI can help Wikipedia to improve its quality and fight vandalism.
Let's concentrate on the former.

A huge barrier here is that Wikipedia is huge (and gets bigger!). There are tools available for processing the Wikipedia dumps, but the best working I found so far is the open source Java library Wikipedia Miner. Not just because it was developed at the University of Waikato, where I studied. The reasons are its easy installation, an intuitive object-oriented model for accessing Wikipedia data, as well as additional tools for computing similarity between any two English phrases, wikifying any text (i.e. link its phrases to concepts explained in Wikipedia) and even implemented web services. Check out the online demos:
  • Search. If you search for a word like palm, it will list all the possible meanings, starting with the most likely one (Arecaceae - 65% likelihood) and all other meanings, like Palm (PDA) and hand palm. Clicking on a meaning shows the words used to refer to it, e.g. redirects like Palmtree and anchors like palm wood, as well as translations (language links) and broader terms (categories).

  • Compare. Here you can calculate things like vodka is more related to wine than to water.

  • Wikify. This feature helps finding out what concepts are discussed in any text or on any website. Very practical particularly for texts with many named entities, e.g. news articles, but not only. Here is this blog wikified with Wikipedia Miner (the links are added in blue and at the highest possible density).
Note: all this information can be easily accessed from your application when using the Wikipedia Miner library. I've done it in my topic indexing tool Maui and can tell you that it's relatively easy.

Many other people already are using Wikipedia Miner (100 downloads in the last 6 month). It has also been referenced as research tool in various published projects, including adding structure to search queries, finding similar videos, extending ontologies, creating language learning tools and many more.

Sunday, October 18, 2009

Data sets for keyphrase extraction and topic indexing

A few month ago I have made available several collections that researchers can use to develop and evaluate their systems for tasks related to topic indexing. But I didn't blog about it! In the meantime people still found it and have downloaded the data. Here is a summary of collections in order of their popularity.

The most popular collection is Wiki-20. It contains 20 computer science documents, each with main topics assigned independently by 15 teams of graduate students. So, each document has 15 sets of Wikipedia articles that represent the main topics in it, according to the students.
This data set was downloaded 19 times.

There is also a corpus that can be used for keyphrase extraction and tagging with 180 documents indexed by human taggers. It has been harvested from the data on CiteULike and includes only high quality tags. This data set, CiteULike-180) was downloaded 15 times.

There are two corpora for keyphrase indexing, i.e. assignment of terms from controlled vocabularies:
  1. FAO-780 with 780 documents indexed by just one indexer (9 dowloads).
  2. FAO-30 with 30 documents indexed independently by 6 professional indexers each (so far 8 downloads).
This data was provided by the Food and Agriculture Organization of the United Nations. Agrovoc serves as vocabulary, but any other vocabulary in SKOS format can be used in a similar way.

These statistics indicate that traditional term assignment is not a popular research topic any more. Tagging and, particularly, indexing with terms from Wikipedia is what is researched more actively. Great to see that Wikipedia gets the attention it deserves and that my idea of using it as a controlled vocabulary was picked up by others.

More information is also available on the wiki page explaining the multiply indexed data sets and on a page listing other resources for topic indexing.

Wednesday, July 29, 2009

Wiki pages about topic indexing

Here is a list of updates on Maui's google code page, which will hopefully make it easier for others to use Maui and experiment with its data sets:

Let me know if something is missing.

Thursday, July 16, 2009

Updated release and French term assignment example

Turns out I had two pending issues on Google Code, where I host the Maui algorithm. Per default the project owner does not gets a notification!

So today I went ahead and fixed one of the requests: to have everything in a jar file.
I've also updated the example files and added Javadoc documentation. Soon I will publish a detailed installation instruction (additionally to the usage instructions), but for now just this one command line example. It shows how to create a topic indexing model and apply it to new document on the example of term assignment with French documents. Download the latest release of Maui (1.1) and then try this:

java -Xmx1024m -classpath maui-1.0.jar maui.main.FrenchExample

If the Java classpath is not yet linked to Maui's libraries, add this after maui-1.0.jar:


The output should be something like:

-- Building the model...
--- Loading the vocabulary...
--- Building the Vocabulary index from the SKOS file...


-- Reading the input documents...


--- Computing candidates...


--- Building classifier


-- Extracting keyphrases...

-- Keyphrases and feature values:

http://www.fao.org/aos/agrovoc#c_4830,'Produit laitier',0,0,0.003276,0.00335,...,False


http://www.fao.org/aos/agrovoc#c_3919,'Commerce international',...,True

http://www.fao.org/aos/agrovoc#c_8323,'Besoin en eau',...,False
-- 2.0 correct

-- Evaluation results based on 1 document:
Avg. number of correct keyphrases per document: 2 +/- 0
Precision: 25 +/- 0
Recall: 13.33 +/- 0
F-Measure: 17.39

For each test document (in this case, just one is used), Maui outputs its Agrovoc ID, name of the concept (e.g. Besoin en eau), some feature values and True or False, depending on whether this term as been assigned to this document by a human indexer. Based on these values evaluation is performed. Because the directory already contains a .key file, Maui does not override it, otherwise it would create one with automatically generated topics.

This is of course just a demonstration: after training on just two documents and testing on a third one. But at least it shows (I hope!) how simple Maui's usage can be.

Monday, July 13, 2009

Useful web resources related to automatic topic indexing

Tools (in alphabetical order):
for keyword and keyphrase extraction, tagging, autotagging, terminology extraction, term assignment, text classification, text categorization, topical metadata extraction and topic indexing with Wikipedia
  • Bibclassify - A module in CDS Invenio (CERN’s document server software) for automatic assignment of terms from SKOS vocabularies, developed on the High Energy Physics vocabulary. Developed in the collaboration between CERN and DESY. There is also a hacking guide.

  • Extractor - Commercial software for keyword extraction in different languages. There is also a demo. Developed at the National Research Council of Canada.

  • Keyphrase extraction algorithm Kea. Can be used for both automatic keyphrase extraction and term assignment with controlled vocabularies. Developed at the University of Waikato.

  • Multi-purpose topic indexing algorithm Maui. Suitable for automatic term assignment, subject indexing, keyword extraction, keyphrase extraction, indexing with Wikipedia, autotagging, terminology extraction. Developed at the University of Waikato. Maui is also available on sourceforge.

  • TerMine - a term extraction tool developed at the National Centre for Text Mining.

  • Topia term extractor - Part-of-speech and frequency based term extraction tool implemented in python. Here is a term extraction demo based on this tool implemented by FiveFilters.org

  • Orchestr8 Keyword Extraction - An API based application that uses statistical and natural language processing methods. Applicable to webpages, text files and any input text in several languages.

  • Wikifier – An online demo of detecting Wikipedia articles in text developed at the Language and Information Technologies research group at the University of North Texas.

  • Wikipedia Miner – An API for accessing Wikipedia data, which also provides a tool for mapping any document to a set of relevant Wikipedia articles, similar to indexing with Wikipedia. Developed at the University of Waikato. Demo 1 and demo 2.

  • SEO keyword extraction - An online keyword and keyphrase extraction tool for search engine optimization

  • Scorpion – OCLC’s tool for automatic classification of documents.

  • Tagthe.net – A demo and an API for automatic tagging of web documents and texts. Tags can be single words only. The tool also recognizes named entities such as people names and locations.

  • Yahoo term extraction - Web-service based content analysis via term extraction, includes a demo.

Vocabularies in SKOS format and test data:
More resources:
  • NLM Indexing Initiative – Website about National Library of Medicine’s project on automatic indexing using MeSH terms. Research details, evaluation and examples.
  • Dublin Core tools – A list of tools for automatic extraction of Dublin Core metadata
  • ASI resources – List of commercial software for back-of-the-book indexing by American Society of Indexing
  • ANZSI resources – List of software tools provided by the Australian and New Zealand Society of Indexing

Wednesday, July 8, 2009

What do subject indexing, keyphrase extraction and autotagging have in common? Terminology clarification

There has been a lot of confusion about tasks related to topic indexing. Here is an overview of these tasks, terms used to refer to them and what they stand for.
  1. Text categorization (or: text classification) - Very few general categories, like Politics or News, are assigned usually from a relatively small vocabulary.

  2. Term assignment (or: subject indexing) - Document's main topics are expressed using terms from a large vocabulary, e.g. a domain-specific thesaurus.

  3. Keyphrase extraction (or: keyword extraction, key term extraction) - Document's main topics are expressed using the most prominent words and phrases in a document.

  4. Terminology extraction (similar to back-of-the-book indexing) - All domain relevant words and phrases are extracted from a document.

  5. Full-text indexing (or: full indexing, free text indexing) - All words and phrases, sometimes excluding the stopwords, are extracted from a document.

  6. Keyphrase indexing (or: keyphrase assignment) - A general term, which refers to both term assignment and keyphrase extraction.

  7. Tagging (or: collaborative tagging, social tagging and when performed automatically: autotagging, automatic tagging) - The user defines as many topics as desired. Any word or phrase can serve as a tag. Prevalently applied on collaborative websites.

  8. Clustering is related to topic indexing in that it identifies groups of documents on the same topic; however, these groups are unlabeled.

Sunday, June 28, 2009

How to generate visualizations of topics with Dotty and GraphViz

In a previous post, I have shown Wikipedia article titles that Maui assigned to the introduction of my thesis. To create a nice visualization of these topics, instead of the less expressive tag clouds, I used WikipediaMiner and Dotty (via GraphViz for Mac).

I have written a script that generates a .gv file with the following content:

graph G {
"Machine learning" -- "Keywords"[style=invis];
"Machine learning" -- "Natural language" [penwidth = "3"];
"Machine learning" -- "Knowledge" [penwidth = "2"];
The nodes are the names of the topics and the links are expressed using semantic relatedness values computed using WikipediaMiner. The values are modified to reflect the line width in the generated graph:
  • style=invis if relatedness = 0;
  • otherwise penwidth is used.
The number is the original relatedness value (between 0 and 1) multiplied by 10 and made into an integer.

Update: Additionally the top keyphrase returned by the algorithm is defined as the root of the graph (e.g. graph [root="Index (search engine)"] for the graph below). Also the font size reflects the rank of the keyphrase as determined by the algorithm.
Click on the graph to see the graph in full resolution. Or view the complete GraphViz script.

The beauty of GraphViz is that the generated graph can be exported into any format and expanded to any required size.

Furthermore, the visualization can be generated for any list of topics, as long as they can be mapped to titles of Wikipedia articles:

// first check if there is an article with the title "topic"
Article article = wikipedia.getArticleByTitle(topic);

// otherwise retrieve the most likely mapping
if (article == null) {
CaseFolder cs = new CaseFolder();
article = wikipedia.getMostLikelyArticle(topic,cs);

The script for generating such graphs is included as a part of Maui software.

Saturday, June 27, 2009

How often taggers agree with each other?

... or better how often taggers of my thesis have agreed with each other?

Nine of my friends (all Computer Scientists graduates and IT professionals), who all helped me with proof reading my thesis Human-competitive automatic topic indexing, choose five tags that describe its main topics. Each one was familiar with my work, read parts of the thesis and the abstract.

General impression

There was no tag on which the nine people agreed! Five of them picked tagging, although this is only one of the three tasks that is addressed in the thesis. There was a struggle with compound words like topic indexing (should it be just indexing and topics or document topics?) and with adjectives (should they be used as separate tags, e.g. automatic, statistical or as modifiers of existing tags, e.g. automatic tagging).

One of the people picked controlled vocabularies, another controlled-vocabulary. When comparing the results, I treated these tags as the same thing, however, I didn't do it with other tags, which also represented the same thing but were expressed slightly different: topic indexing and topic assignment. In general, everyone agreed on the general topics but expressed them differently.

Two topic clouds (same tags, but different layout) show all tags assigned by everyone to the thesis:

Tag cloud 1

algorithm automatic automatic tagging automatic topic indexing
artificial intelligence computer science
controlled vocabularies
document categorization document topics
domain-specific knowledge
encyclopedic knowledge human competitive
human indexing
indexing indexing methods kea
keyphrase extraction
machine learning natural language processing
tag hierarchies taxonomies term assignment
topic indexing
topic assignment topics semantic
supervised learning
statistical wikipedia

Tag cloud 2

tagging wikipedia indexing machine learning
topic indexing controlled vocabularies keyphrase extraction
computer science automatic topic indexing
automatic automatic tagging auto-tagging artificial intelligence
document categorization document topics domain-specific knowledge
encyclopedic knowledge human competitive human indexing
indexing methods kea natural language processing tag hierarchies
taxonomies term assignment topic assignment topics semantic
supervised learning statistical

Consistency of taggers of my thesis

Consistency analysis is a traditional way of assessing indexing quality (more on this below). I applied this metric to evaluate tags assigned to my thesis and here are the results:

A - 22.5
D - 16.1
G - 20.3
J - 2.5
K - 27.8
N - 15.0
S - 5.4
T1 - 8.3
T2 - 29.6

The average consistency in this group is 16.4%, with the best tagger achieving nearly 30%.
These results were based on only one document and therefore are only a guideline.

About indexing consistency

In traditional library, inter-indexer consistency analysis is used to assess how well professionals assign subject headings to library holdings. The higher is the consistency, the better will be topic-based search in the catalog. It is a logical consequence, because if a librarians agrees with his colleague, it is likely to agree with the patron.

Admittedly, tagging is slightly different. Taggers, who assign tags for their own use, chose them based on personal preferences that might not be of use to others. But since tags are widely used by others, their quality is as important as that of subject headings in libraries.

In one of the experiments, reported earlier, I have analyzed consistency of taggers on CiteULike and used it to evaluate automatic tags produced by Maui. The consistency of taggers varied from 0 to 91%, with an average of 18.5%. Thus, my friends performed pretty much as good as an average CiteULike tagger.

Friday, June 26, 2009

Document keywords from Wikipedia

Maui can be used for identifying the main topics in a document and expressing them in form of well-formed titles of Wikipedia articles.

To create a training model I used a collection with 20 computer science papers indexed by graduate students in an indexing experiment for my AAAI paper "Topic indexing with Wikipedia".

Then I applied this model to generate ten keywords and keyphrases for the Introduction chapter of my PhD Thesis. I am very pleased with the results:

The best thing about keyphrase extraction with Wikipedia is that it links each term to a concept -- Wikipedia article explaining this term. Thus, the resulting keyphrase sets will be consistent: they will all refer to "Dissertation" and not "PhD thesis" or "Doctoral thesis", disregarding of which term is actually used in the document to express this concept.

Each topic can be directly linked to the Wikipedia article by adding "http://en.wikipedia.org/wiki/" in front of the term as I did in the above list.

How to use Maui

Here are some usage instructions that are also published on Maui's wiki page.

Preparing the data

After Maui is installed, there are two ways of using it: from the command line and from the Java code. Either way, the input data is required first. The data directory in Maui's download package contains some examples of input data.

1. Formatting the document files.
Each document has to be stored individually in text form in a file with extension .txt. Maui takes as an input the name of the directory with such files. If a model needs to be created first, the same directory should contain main topics assigned manually to each document.

2. Formatting the topic files.
The topic sets need to be stored individually in text form, one topic per line, in a file with the same name as the document text, but with the extension .key.

3. Maui's Output.
If Maui is used to generate main topics for new documents, it will create .key files for each document in the input directory. If topics are generated, but .key files are already existent, the existing topics are used as gold standard for the evaluation of automatically extracted ones.

Command line usage

Maui can be used directly from the command line. The general command is:
java maui.main.MauiModelBuilder
(or maui.main.MauiTopicExtractor)
-l directory (directory with the data)
-m model (model file)
-v vocabulary (vocabulary name)
-f {skos|text} (vocabulary format)
-w database@server (wikipedia location)
Which class is used depends on the current mode of topic indexing. MauiModelBuilder is used when a topic indexing model is created from documents with existing topics. MauiTopicExtractor is used when a model is created, to assign topics to new documents.

Examples with experimental data are supplied in the Maui package. The following commands refer to the directories with this data. They correspond to different topic indexing tasks:

1. Automatic tagging and keyphrase extraction - when topics are extracted from document text itself.
MauiModelBuilder -l data/automatic_tagging/train/
-m tagging_model
MauiTopicExtractor -l data/automatic_tagging/test/
-m tagging_model
2. Term assignment - when topics are taken from a controlled vocabulary in SKOS format.
MauiModelBuilder -l data/term_assignment/train/
-m assignment_model
-v agrovoc
-f skos
MauiTopicExtractor -l data/term_assignment/test/
-m assignment_model
-v agrovoc
-f skos
3. Topic indexing with Wikipedia - when topics are Wikipedia article titles. Note in this case WikipediaMiner needs to be installed and running first.

MauiModelBuilder -l data/wikipedia_indexing/train/
-m indexing_model
-v wikipedia
-w enwiki@localhost
MauiTopicExtractor -l data/wikipedia_indexing/test/
-m indexing_model
-v wikipedia
-w enwiki@localhost
For terminology extraction use the command line argument -n set to a high value to extract all possible candidate topics in the document.

Thursday, June 4, 2009

Maui will be presented at EMNLP'09

The paper Human-competitive tagging using automatic keyphrase extraction, co-authored with Eibe Frank and Ian Witten, was accepted for the EMNLP conference (Conference on Empirical Methods in Natural Language Processing) in Singapore this August.

In this paper, we evaluate the tagging quality of the users of the collaborative platform CiteULike using traditional methods. The inter-indexer consistency compares how well human taggers agree with their co-taggers on what tags should be assigned to a particular document. The higher the agreement, the higher the usefulness of the assigned tags.

The consistency of CiteULike's taggers resembles the power-low distribution, with a few users achieving high consistency values and a long tail of inconsistent taggers. On average, the taggers are 19% consistent with each other. Based on the consistency values we identify a group of best performing taggers and use them as a gold standard to evaluate an automatic tagging technique. There are 35 such taggers, who achieve an average consistency of 38%.

Next, we apply the keyphrase extraction algorithm Kea and the topic indexing algorithm Maui to the documents tagged by these users and compare the automatically assigned tags to those assigned by humans. Maui's consistency with all taggers is 24%, 5 percentage points higher than that of humans. Maui's consistency with best taggers is 35%. Slightly less than their consistency with each other, but still better than that of 17 out of 35 taggers.

The approach is a combination of machine learning and statistical and linguistic analysis of words and phrases that appear in the documents. Maui also uses the online encyclopaedia Wikipedia as a knowledge base for computing semantic information about words.

Interested? Read the full paper: Human-competitive tagging using automatic keyphrase extraction

Saturday, May 30, 2009

What is Maui about?

The Maui topic indexing algorithm was created as a part of my PhD in Computer Science at the University of Waikato.
The title of the PhD thesis is

Human-competitive automatic topic indexing

Here is its abstract, which sums up what the algorithm is about:

Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document’s topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance.

Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples.

This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is “human-competitive” because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages.

Tuesday, May 19, 2009

Maui - first file release!

Today I have submitted the first version of Maui to Sourceforge and currently adding Maui to Google code. In a later post I am planning to write about what Maui does, as well as about my experience with hosting and versioning an open source project on both platforms. But for now, this is to be the first post on this blog.