The paper Human-competitive tagging using automatic keyphrase extraction, co-authored with Eibe Frank and Ian Witten, was accepted for the EMNLP conference (Conference on Empirical Methods in Natural Language Processing) in Singapore this August.
In this paper, we evaluate the tagging quality of the users of the collaborative platform CiteULike using traditional methods. The inter-indexer consistency compares how well human taggers agree with their co-taggers on what tags should be assigned to a particular document. The higher the agreement, the higher the usefulness of the assigned tags.
The consistency of CiteULike's taggers resembles the power-low distribution, with a few users achieving high consistency values and a long tail of inconsistent taggers. On average, the taggers are 19% consistent with each other. Based on the consistency values we identify a group of best performing taggers and use them as a gold standard to evaluate an automatic tagging technique. There are 35 such taggers, who achieve an average consistency of 38%.
Next, we apply the keyphrase extraction algorithm Kea and the topic indexing algorithm Maui to the documents tagged by these users and compare the automatically assigned tags to those assigned by humans. Maui's consistency with all taggers is 24%, 5 percentage points higher than that of humans. Maui's consistency with best taggers is 35%. Slightly less than their consistency with each other, but still better than that of 17 out of 35 taggers.
The approach is a combination of machine learning and statistical and linguistic analysis of words and phrases that appear in the documents. Maui also uses the online encyclopaedia Wikipedia as a knowledge base for computing semantic information about words.
Interested? Read the full paper: Human-competitive tagging using automatic keyphrase extraction