Topic indexing blog: June 2009

Sunday, June 28, 2009

How to generate visualizations of topics with Dotty and GraphViz

In a previous post, I have shown Wikipedia article titles that Maui assigned to the introduction of my thesis. To create a nice visualization of these topics, instead of the less expressive tag clouds, I used WikipediaMiner and Dotty (via GraphViz for Mac).

I have written a script that generates a .gv file with the following content:


graph G {
"Machine learning" -- "Keywords"[style=invis];
"Machine learning" -- "Natural language" [penwidth = "3"];
"Machine learning" -- "Knowledge" [penwidth = "2"];
...
}

The nodes are the names of the topics and the links are expressed using semantic relatedness values computed using WikipediaMiner. The values are modified to reflect the line width in the generated graph:

style=invis if relatedness = 0;
otherwise penwidth is used.

The number is the original relatedness value (between 0 and 1) multiplied by 10 and made into an integer.

Update: Additionally the top keyphrase returned by the algorithm is defined as the root of the graph (e.g. graph [root="Index (search engine)"] for the graph below). Also the font size reflects the rank of the keyphrase as determined by the algorithm.

Click on the graph to see the graph in full resolution. Or view the complete GraphViz script.

The beauty of GraphViz is that the generated graph can be exported into any format and expanded to any required size.

Furthermore, the visualization can be generated for any list of topics, as long as they can be mapped to titles of Wikipedia articles:


// first check if there is an article with the title "topic"
Article article = wikipedia.getArticleByTitle(topic);

// otherwise retrieve the most likely mapping
if (article == null) {
CaseFolder cs = new CaseFolder();
article = wikipedia.getMostLikelyArticle(topic,cs);
}

The script for generating such graphs is included as a part of Maui software.

Saturday, June 27, 2009

How often taggers agree with each other?

... or better how often taggers of my thesis have agreed with each other?

Nine of my friends (all Computer Scientists graduates and IT professionals), who all helped me with proof reading my thesis Human-competitive automatic topic indexing, choose five tags that describe its main topics. Each one was familiar with my work, read parts of the thesis and the abstract.

General impression

There was no tag on which the nine people agreed! Five of them picked tagging, although this is only one of the three tasks that is addressed in the thesis. There was a struggle with compound words like topic indexing (should it be just indexing and topics or document topics?) and with adjectives (should they be used as separate tags, e.g. automatic, statistical or as modifiers of existing tags, e.g. automatic tagging).

One of the people picked controlled vocabularies, another controlled-vocabulary. When comparing the results, I treated these tags as the same thing, however, I didn't do it with other tags, which also represented the same thing but were expressed slightly different: topic indexing and topic assignment. In general, everyone agreed on the general topics but expressed them differently.

Two topic clouds (same tags, but different layout) show all tags assigned by everyone to the thesis:

Tag cloud 1

algorithm automatic automatic tagging automatic topic indexing
auto-tagging artificial intelligence computer science
controlled vocabularies document categorization document topics
domain-specific knowledge encyclopedic knowledge human competitive
human indexing indexing indexing methods kea
keyphrase extraction machine learning natural language processing
tagging tag hierarchies taxonomies term assignment
topic indexing topic assignment topics semantic
supervised learning statistical wikipedia

Tag cloud 2

tagging wikipedia indexing machine learning
topic indexing controlled vocabularies keyphrase extraction
computer science automatic topic indexing algorithm
automatic automatic tagging auto-tagging artificial intelligence
document categorization document topics domain-specific knowledge
encyclopedic knowledge human competitive human indexing
indexing methods kea natural language processing tag hierarchies
taxonomies term assignment topic assignment topics semantic
supervised learning statistical

Consistency of taggers of my thesis

Consistency analysis is a traditional way of assessing indexing quality (more on this below). I applied this metric to evaluate tags assigned to my thesis and here are the results:

A - 22.5
D - 16.1
G - 20.3
J - 2.5
K - 27.8
N - 15.0
S - 5.4
T1 - 8.3
T2 - 29.6

The average consistency in this group is 16.4%, with the best tagger achieving nearly 30%.
These results were based on only one document and therefore are only a guideline.

About indexing consistency

In traditional library, inter-indexer consistency analysis is used to assess how well professionals assign subject headings to library holdings. The higher is the consistency, the better will be topic-based search in the catalog. It is a logical consequence, because if a librarians agrees with his colleague, it is likely to agree with the patron.

Admittedly, tagging is slightly different. Taggers, who assign tags for their own use, chose them based on personal preferences that might not be of use to others. But since tags are widely used by others, their quality is as important as that of subject headings in libraries.

In one of the experiments, reported earlier, I have analyzed consistency of taggers on CiteULike and used it to evaluate automatic tags produced by Maui. The consistency of taggers varied from 0 to 91%, with an average of 18.5%. Thus, my friends performed pretty much as good as an average CiteULike tagger.

Friday, June 26, 2009

Document keywords from Wikipedia

Maui can be used for identifying the main topics in a document and expressing them in form of well-formed titles of Wikipedia articles.

To create a training model I used a collection with 20 computer science papers indexed by graduate students in an indexing experiment for my AAAI paper "Topic indexing with Wikipedia".

Then I applied this model to generate ten keywords and keyphrases for the Introduction chapter of my PhD Thesis. I am very pleased with the results:

The best thing about keyphrase extraction with Wikipedia is that it links each term to a concept -- Wikipedia article explaining this term. Thus, the resulting keyphrase sets will be consistent: they will all refer to "Dissertation" and not "PhD thesis" or "Doctoral thesis", disregarding of which term is actually used in the document to express this concept.

Each topic can be directly linked to the Wikipedia article by adding "http://en.wikipedia.org/wiki/" in front of the term as I did in the above list.

How to use Maui

Here are some usage instructions that are also published on Maui's wiki page.

Preparing the data

After Maui is installed, there are two ways of using it: from the command line and from the Java code. Either way, the input data is required first. The data directory in Maui's download package contains some examples of input data.

1. Formatting the document files.
Each document has to be stored individually in text form in a file with extension .txt. Maui takes as an input the name of the directory with such files. If a model needs to be created first, the same directory should contain main topics assigned manually to each document.

2. Formatting the topic files.
The topic sets need to be stored individually in text form, one topic per line, in a file with the same name as the document text, but with the extension .key.

3. Maui's Output.
If Maui is used to generate main topics for new documents, it will create .key files for each document in the input directory. If topics are generated, but .key files are already existent, the existing topics are used as gold standard for the evaluation of automatically extracted ones.

Command line usage

Maui can be used directly from the command line. The general command is:

java maui.main.MauiModelBuilder
(or maui.main.MauiTopicExtractor)
             -l directory   (directory with the data)
             -m model       (model file)
             -v vocabulary  (vocabulary name)
             -f {skos|text} (vocabulary format)
             -w database@server (wikipedia location)

Which class is used depends on the current mode of topic indexing. MauiModelBuilder is used when a topic indexing model is created from documents with existing topics. MauiTopicExtractor is used when a model is created, to assign topics to new documents.

Examples with experimental data are supplied in the Maui package. The following commands refer to the directories with this data. They correspond to different topic indexing tasks:

1. Automatic tagging and keyphrase extraction - when topics are extracted from document text itself.

MauiModelBuilder -l data/automatic_tagging/train/
                -m tagging_model

MauiTopicExtractor -l data/automatic_tagging/test/
                  -m tagging_model

2. Term assignment - when topics are taken from a controlled vocabulary in SKOS format.

MauiModelBuilder -l data/term_assignment/train/
                -m assignment_model
                -v agrovoc
                -f skos

MauiTopicExtractor -l data/term_assignment/test/
                  -m assignment_model
                  -v agrovoc
                  -f skos

3. Topic indexing with Wikipedia - when topics are Wikipedia article titles. Note in this case WikipediaMiner needs to be installed and running first.

MauiModelBuilder -l data/wikipedia_indexing/train/
                -m indexing_model
                -v wikipedia
                -w enwiki@localhost

MauiTopicExtractor -l data/wikipedia_indexing/test/
                  -m indexing_model
                  -v wikipedia
                  -w enwiki@localhost

For terminology extraction use the command line argument -n set to a high value to extract all possible candidate topics in the document.

Thursday, June 4, 2009

Maui will be presented at EMNLP'09

The paper Human-competitive tagging using automatic keyphrase extraction, co-authored with Eibe Frank and Ian Witten, was accepted for the EMNLP conference (Conference on Empirical Methods in Natural Language Processing) in Singapore this August.

In this paper, we evaluate the tagging quality of the users of the collaborative platform CiteULike using traditional methods. The inter-indexer consistency compares how well human taggers agree with their co-taggers on what tags should be assigned to a particular document. The higher the agreement, the higher the usefulness of the assigned tags.

The consistency of CiteULike's taggers resembles the power-low distribution, with a few users achieving high consistency values and a long tail of inconsistent taggers. On average, the taggers are 19% consistent with each other. Based on the consistency values we identify a group of best performing taggers and use them as a gold standard to evaluate an automatic tagging technique. There are 35 such taggers, who achieve an average consistency of 38%.

Next, we apply the keyphrase extraction algorithm Kea and the topic indexing algorithm Maui to the documents tagged by these users and compare the automatically assigned tags to those assigned by humans. Maui's consistency with all taggers is 24%, 5 percentage points higher than that of humans. Maui's consistency with best taggers is 35%. Slightly less than their consistency with each other, but still better than that of 17 out of 35 taggers.

The approach is a combination of machine learning and statistical and linguistic analysis of words and phrases that appear in the documents. Maui also uses the online encyclopaedia Wikipedia as a knowledge base for computing semantic information about words.

Interested? Read the full paper: Human-competitive tagging using automatic keyphrase extraction