<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-7311448782714063415</id><updated>2011-12-02T14:40:29.798+13:00</updated><title type='text'>Topic indexing blog</title><subtitle type='html'>For everything related to keyword extraction, keyphrase extraction, term assignment, automatic tagging, subject indexing, terminology extraction.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>17</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-7922600934465810133</id><published>2010-07-06T18:05:00.002+12:00</published><updated>2010-07-24T14:47:41.477+12:00</updated><title type='text'>Demo of term assignment &amp; keyphrase extracton with Maui</title><content type='html'>It's been a while since my last post on this blog, but in the meantime Maui wasn't put to rest.&lt;br /&gt;&lt;br /&gt;The most important news is that there is a new &lt;a href="http://maui-indexer.appspot.com/"&gt;demo of Maui on Google AppEngine&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The main purpose of this demo is to show how Maui assigns terms from controlled vocabularies to documents. (This task is &lt;a href="http://maui-indexer.blogspot.com/2009/07/what-do-subject-indexing-keyphrase.html"&gt;similar to text categorization&lt;/a&gt; with large number of categories.) The documents can be in text, Microsoft Word, or PDF format, and two kinds of vocabularies are to choose from: physics or agriculture.&lt;br /&gt;&lt;br /&gt;The demo also shows how Maui extracts keyword. In this case, Maui was trained on 180 Computer Science documents used at the SemEval-2010 keyphrase extraction track.&lt;br /&gt;&lt;br /&gt;Some more information on this demo can be found in my recent publication  &lt;a href="http://www.medelyan.com/files/MAUIdemoforJCDL2010.pdf?attredirects=0"&gt;Subject  Metadata Support Powered by Maui&lt;/a&gt;. It was co-authored with &lt;a href="http://www.cs.waikato.ac.nz/%7Eihw"&gt;Ian H. Witten&lt;/a&gt; and &lt;a href="http://twitter.com/vye"&gt;Vye Perrone&lt;/a&gt; and presented at the Joint Conference on Digital Libraries in Australia last month.&lt;br /&gt;&lt;br /&gt;Some technical notes: AppEngine has a few restrictions, which don't allow me to demo Maui's full functionality. For example, very large vocabularies cannot be uploaded to appspot, although it's possible if the demo runs locally. Also  the way, Wikipedia is used in Maui is not suitable for the AppEngine framework.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-7922600934465810133?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/7922600934465810133/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2010/07/demo-of-term-assignment-keyphrase.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/7922600934465810133'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/7922600934465810133'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2010/07/demo-of-term-assignment-keyphrase.html' title='Demo of term assignment &amp; keyphrase extracton with Maui'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-2349043895798099399</id><published>2010-01-24T15:28:00.002+13:00</published><updated>2010-01-24T15:45:23.080+13:00</updated><title type='text'>New version of Maui available</title><content type='html'>This weekend I have put together a &lt;a href="http://code.google.com/p/maui-indexer/downloads/list"&gt;Maui 1.2&lt;/a&gt;, a new version, which incorporates changes requested by several users, as well as some of their great ideas.&lt;br /&gt;&lt;br /&gt;Here is a list of changes in Maui 1.2:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Input files&lt;/span&gt; are now read using &lt;a href="http://commons.apache.org/io/"&gt;&lt;span class="il"&gt;Apache&lt;/span&gt; commons IO Package&lt;/a&gt;.  This makes the data reading part around 10 times faster and it also saves many lines of code.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Vocabularies&lt;/span&gt; are now stored in GZip format (as &lt;span style="font-style: italic;"&gt;*.gz&lt;/span&gt;) and are read in using GZipInputStream. This saves a lot of space, because SKOS format tends to repeat the same characters over and over. In fact, the vocabularies are now so tiny that I could easily supply them within the distribution. The SKOS files in &lt;span style="font-style: italic;"&gt;data/vocabularies&lt;/span&gt; were created by cutting out all irrelevant (to Maui) information from the original files and compressing them.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Stopwords&lt;/span&gt; are now initialized from a supplied file, rather than hard encoded one. The users can thus use their own stopwords and black listed terms.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;I wrote a new class &lt;span style="font-weight: bold;"&gt;MauiWrapper&lt;/span&gt; that shows how to apply Maui to a single text file, or a text string. Another new class &lt;span style="font-weight: bold;"&gt;MauiWrapperFactory&lt;/span&gt; shows how to use MauiWrapper with several vocabularies at the same time. These classes make it easy to create web services that use Maui for identifying main topics from text supplied by the web client.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Finally, I have also generated a few models in &lt;span style="font-style: italic;"&gt;data/models&lt;/span&gt; directory for those who don't have their own training data.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;span class="gI"&gt;&lt;span class="go"&gt;Thanks to Florian Ilgenfritz at Txtr&lt;/span&gt;&lt;/span&gt; and &lt;span class="il"&gt;Jose&lt;/span&gt; R. Pérez-Agüera at UNC for their helpful suggestions!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-2349043895798099399?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/2349043895798099399/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2010/01/new-version-of-maui-available.html#comment-form' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/2349043895798099399'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/2349043895798099399'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2010/01/new-version-of-maui-available.html' title='New version of Maui available'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-6965746672983304594</id><published>2009-12-01T16:19:00.012+13:00</published><updated>2009-12-03T16:43:25.968+13:00</updated><title type='text'>Analyzing text? Think beyond words!</title><content type='html'>In &lt;a href="http://maui-indexer.blogspot.com/2009/11/on-tag-clouds-phrase-nets-stream-graphs.html"&gt;previous post&lt;/a&gt;, I discussed how words that appear in a document can be visualized using&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;word nets&lt;/span&gt;, where words are nodes and their co-occurrences are links;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;tag clouds&lt;/span&gt;, where font size indicates word frequency, but no relations are shown;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;phrase nets&lt;/span&gt;, which show how words are connected by patterns like "&lt;span style="font-style: italic;"&gt;X of Y&lt;/span&gt;";&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;stream graphs&lt;/span&gt;, which show how word usage changes over time.&lt;/li&gt;&lt;/ul&gt;All these approaches concentrate on words rather than meaningful entities (concepts, named entities, semantic relations)&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Using single words is problematic but there are some solutions:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Problem 1: Ambiguity&lt;/span&gt;. Single words are limited what they express, e.g. &lt;span style="font-style: italic;"&gt;python&lt;/span&gt; can mean the animal and the programming language. Where relations are not visible, the meaning gets lost.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Possible solution:&lt;/span&gt; &lt;span&gt;Think in relations rather than single words.&lt;/span&gt; Semantic relations must be shown to disambiguate the meaning. They can be derived from corpora using co-occurrence statistics, or from vocabularies, thesauri,and  ontologies.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Problem 2: Synonymy.&lt;/span&gt; Words that mean the same thing are often visualized as several entities (&lt;span style="font-style: italic;"&gt;US&lt;/span&gt;, &lt;span style="font-style: italic;"&gt;U.S., usa&lt;/span&gt;, &lt;span style="font-style: italic;"&gt;u.s.a.&lt;/span&gt;, &lt;span style="font-style: italic;"&gt;unitedstates&lt;/span&gt;), which again neglects  the meaning.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Possible solution:&lt;/span&gt; &lt;span&gt;Think in concepts rather than single words.&lt;/span&gt; The minimum solution for working with English is to remove plural and possessives. Look into &lt;a href="http://en.wikipedia.org/wiki/Stemming"&gt;stemming&lt;/a&gt;, implement normalization routines. The more advanced solution is to use synonyms. Controlled vocabularies are constructed manually and help link synonymous words. Synonymy can be also computed statistically.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Problem 3: Compounds&lt;/span&gt; Many approaches tokenize text at white spaces, which corrupts the meaning of compound phrases and multi-word-expressions. &lt;span style="font-style: italic;"&gt;New Zealand&lt;/span&gt; becomes &lt;span style="font-style: italic;"&gt;zealand &lt;/span&gt;and &lt;span style="font-style: italic;"&gt;new&lt;/span&gt;; &lt;span style="font-style: italic;"&gt;data mining&lt;/span&gt; turns into &lt;span style="font-style: italic;"&gt;mining&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;data&lt;/span&gt;; &lt;span style="font-style: italic;"&gt;hotdog&lt;/span&gt; into &lt;span style="font-style: italic;"&gt;dog&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;hot&lt;/span&gt;.&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Possible solution:&lt;/span&gt; &lt;span&gt;Think in &lt;/span&gt;&lt;span style="font-style: italic;"&gt;n&lt;/span&gt;&lt;span&gt;-grams rather than single words. &lt;/span&gt;When splitting a text into elements, be careful where you split. If two and three word phrases (&lt;span style="font-style: italic;"&gt;2&lt;/span&gt;-grams and &lt;span style="font-style: italic;"&gt;3&lt;/span&gt;-grams) are included into analysis, their scores need to be additionally boosted. Alternatively, one can use vocabularies of named entities and domain-relevant phrases, or even named entity extraction algorithms.&lt;br /&gt;&lt;br /&gt;There are many more solutions than described here, and it is hard to judge which solution is ideal. Statistics relies on availability of text collections, whereas vocabulary and thesauri are bound to be incomplete and are not always available. A universal tool that would solve the problems of single words with a minimum of requirements is yet to be implemented. If you know of one, let me know and I will post it here.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-6965746672983304594?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/6965746672983304594/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/12/analyzing-text-think-beyond-words.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/6965746672983304594'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/6965746672983304594'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/12/analyzing-text-think-beyond-words.html' title='Analyzing text? Think beyond words!'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-1128690002181237698</id><published>2009-11-23T18:16:00.020+13:00</published><updated>2009-12-03T16:09:37.883+13:00</updated><title type='text'>Tag clouds, phrase nets, stream graphs &amp; co</title><content type='html'>Once the key concepts in a document are known, they can be combined into a more meaningful representation than just a list. This blog post describes different methods that use visualization of document's topics and their semantic relations.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;TextArc creates nets of words&lt;/span&gt;&lt;br /&gt;&lt;a href="http://textarc.org/"&gt;TextArc&lt;/a&gt; was invented in 2002. The constraints in visualization tools at that time perhaps explain its rather simplistic approach. It first extracts individual words that appear in a text and stems them. Then, it computes the co-occurrences of word pairs and plots them into a very large circular graph, where important words appear in the center and less important ones are outside of the circle. Co-occurring words are placed next to each other. The graph is interactive: If a user clicks on a word, the connections to other words are activated, see the below example from Alice in Wonderland. However, without activating the words, it is pretty much impossible to read the words and the visual groupings are unclear.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://textarc.org/images/alice5.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 288px; height: 224px;" src="http://3.bp.blogspot.com/_dYCJl7JAgN4/Sww001LNw-I/AAAAAAAAG84/jVoSlsr_KP8/s320/Picture+1.png" alt="TextArc for Alice in Wonderland" id="BLOGGER_PHOTO_ID_5407755334797083618" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Tag clouds are wide-spread but far from perfect&lt;/span&gt;&lt;br /&gt;Tag clouds were first mentioned in literature later: &lt;a href="http://scholar.google.co.nz/scholar?hl=en&amp;amp;q=textarc&amp;amp;um=1&amp;amp;ie=UTF-8&amp;amp;sa=N&amp;amp;tab=ws"&gt;according to Google Scholar, in 2003&lt;/a&gt;. In 2005 Flickr was launched, which successfully applied tag clouds to the problem of image search and categorization. The simplicity of this model became infectious, which made them spread. However, tag clouds are &lt;a href="http://www.zeldman.com/daily/0505a.shtml"&gt;often critisized&lt;/a&gt; for absence of structure and lack of meaning. There have been many attempts to improve tag clouds. &lt;a href="http://www.wordle.net/"&gt;Wordle&lt;/a&gt; creates typographically perfect images of tag clouds.   Search for "tag clouds" on Google Scholar without the year restrain, and you will learn how to automatically cluster tags or add hierarchy to a folksonomy.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;PhraseNets show what topics are connected by a given relations&lt;/span&gt;&lt;br /&gt;&lt;a href="http://manyeyes.alphaworks.ibm.com/manyeyes/page/Phrase_Net.html"&gt;PhraseNet&lt;/a&gt; is this year's answer to TextArc and tag clouds. It concentrates on specific co-occurrence patterns, such as "&lt;span style="font-style: italic;"&gt;X begat Y&lt;/span&gt;", or a more generic one like "&lt;span style="font-style: italic;"&gt;X of Y&lt;/span&gt;". Similarly to conventional tag clouds, frequent terms are shown in a larger font.  Word&lt;span style="font-style: italic;"&gt;s &lt;/span&gt;matched to&lt;span style="font-style: italic;"&gt; &lt;/span&gt;the position of &lt;span style="font-style: italic;"&gt;X&lt;/span&gt; are colored in dark blue and those &lt;span style="font-style: italic;"&gt;Y&lt;/span&gt; in light blue. A customized graph layout algorithm prevents words from overlapping to ensure readability. Compared to conventional tag clouds, PhraseNet adds meaning to the graphs. The below graphs demonstrate the topical shift between the Old and the New Testaments in the Bible.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_dYCJl7JAgN4/Sws-GmBg9CI/AAAAAAAAG8s/AC9JUmd8qq0/s1600/oldtestament_phrasenet.bmp"&gt;&lt;img style="margin: 0pt 10px 10px 0pt; cursor: pointer; width: 320px; height: 237px;" src="http://4.bp.blogspot.com/_dYCJl7JAgN4/Sws-GmBg9CI/AAAAAAAAG8s/AC9JUmd8qq0/s320/oldtestament_phrasenet.bmp" alt="PhraseNet for the Old Testament" id="BLOGGER_PHOTO_ID_5407484060595450914" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_dYCJl7JAgN4/Sws2sekH4MI/AAAAAAAAG8c/WX2FXLmR7Wo/s1600/newtestament_phrasenet.bmp"&gt;&lt;img style="margin: 0px auto 10px; display: block; cursor: pointer; width: 320px; height: 237px;" src="http://2.bp.blogspot.com/_dYCJl7JAgN4/Sws2sekH4MI/AAAAAAAAG8c/WX2FXLmR7Wo/s320/newtestament_phrasenet.bmp" alt="PhraseNet for the New Testament" id="BLOGGER_PHOTO_ID_5407475915335131330" border="0" /&gt;&lt;/a&gt;An interesting finding of PhraseNet researchers was that using a dependency parser does not produce more meaningful results than just simple pattern matching. They used the Standford Parser, which required over 24h for processing 1MB of text.&lt;br /&gt;&lt;br /&gt;Note that PhraseNets are visualization of relations rather than of concepts. For example, it can clearly describe the &lt;span style="font-style: italic;"&gt;possessive relation&lt;/span&gt; with one simple pattern "&lt;span style="font-style: italic;"&gt;X's Y&lt;/span&gt;", or the location relation with "&lt;span style="font-style: italic;"&gt;X at Y&lt;/span&gt;". However, it cannot help with solving tasks such as &lt;span style="font-style: italic;"&gt;visualize everything that relates to a given concept in a text&lt;/span&gt; or &lt;span style="font-style: italic;"&gt;how are two concepts related in text&lt;/span&gt;. It also misses the notion of discourse in the text. The &lt;a href="http://researchweb.watson.ibm.com/visual/papers/phrase-net-rev5.pdf"&gt;full paper&lt;/a&gt; contains more examples, as well as the discussion of undertaken application.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Topical differences in multiple texts can be visualized by combining the clouds  &lt;/span&gt;&lt;br /&gt;Tag clouds can be also meaningfully combined within a single visualization, as proposed in the study "&lt;a href="http://blog.blprnt.com/blog/blprnt/two-sides-of-the-same-story-laskas-gladwell-on-cte-the-nfl"&gt;The two sides of the same story: Laskas &amp;amp; Gladwell on CTE &amp;amp; the NFL&lt;/a&gt;". This time the discourse is taken into account: The algorithm positions the words vertically by their average position in text. Horizontally, the words are shifted onto the left or the right side of the graph depending on where they occur more. Those word that are similarly frequent in both graphs appear in the middle. The frequency is, as usual, expressed with varying font size.&lt;br /&gt;This tool is not particularly universal, but for comparing two similar texts it works well. Using additional dimensions, one could perhaps compare  three or even four texts, but adding further dimensions would negatively influence the readability and usefulness of the tool. Instead, a different kind of graph can be used for &lt;a href="http://www.munterbund.de/visualisierung_textaehnlichkeiten/essay.php#Introduction"&gt;visual comparison of topics discussed in several texts&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Stream graphs demonstrate how topics evolve over time&lt;/span&gt;&lt;br /&gt;Stream graphs were originally designed to visualize how data changes over time, e.g. &lt;a href="http://www.nytimes.com/interactive/2008/02/23/movies/20080223_REVENUE_GRAPHIC.html"&gt;movie revenues&lt;/a&gt; or &lt;a href="http://www.leebyron.com/what/lastfm/"&gt;musical taste&lt;/a&gt;. An approached called &lt;a href="http://infoviz.pnl.gov/research_themeriver.stm"&gt;ThemeRiver&lt;/a&gt; plots how important topics change over time in underlying documents, they then compare &lt;a href="http://infoviz.pnl.gov/images/themeriver675.gif"&gt;how these changes correspond to political events that time&lt;/a&gt;. Or there is also &lt;a href="http://www.neoformix.com/Projects/TwitterStreamGraphs/view.php"&gt;an interesting demo of stream graphs in the latest twitter posts&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://neoformix.com/2009/ObamaUNSpeechStreamGraph.html"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 320px; height: 242px;" src="http://3.bp.blogspot.com/_dYCJl7JAgN4/Sww1ttKiGkI/AAAAAAAAG9A/A8QbMTJkoR0/s320/Picture+2.png" alt="StreamGraph of Obama's speech" id="BLOGGER_PHOTO_ID_5407756311899281986" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Time flow is similar to text flow, or discourse &lt;/span&gt;&lt;br /&gt;Alternatively, the time can be also expressed as text flow in a single document. One can then stream graphs to show how which topics are mentioned in the beginning, in the end or throughout the document. The above figure is an attempt to represent &lt;a href="http://neoformix.com/2009/ObamaUNSpeechStreamGraph.html"&gt;the main topics in Obama's speech and their discourse flow&lt;/a&gt;. Stream graphs provide the suitable framework for visualizing discourse, but their potential is yet unexplored.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Splitting text into single words is harmful &lt;/span&gt;&lt;br /&gt;The above tools use individual words as topic descriptors. A significant disadvantage is that some concepts are described as two or more words: &lt;span style="font-style: italic;"&gt;hot dog&lt;/span&gt;, &lt;span style="font-style: italic;"&gt;New York&lt;/span&gt;, &lt;span style="font-style: italic;"&gt;Google Wave&lt;/span&gt;, &lt;span style="font-style: italic;"&gt;Auckland University of Technology&lt;/span&gt; etc. Split these items into words and each one conveys a completely different meaning.&lt;br /&gt;&lt;br /&gt;The following posts will discuss other disadvantages of single words and describe approaches that visualize concepts and phrases.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-1128690002181237698?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/1128690002181237698/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/11/on-tag-clouds-phrase-nets-stream-graphs.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/1128690002181237698'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/1128690002181237698'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/11/on-tag-clouds-phrase-nets-stream-graphs.html' title='Tag clouds, phrase nets, stream graphs &amp; co'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_dYCJl7JAgN4/Sww001LNw-I/AAAAAAAAG84/jVoSlsr_KP8/s72-c/Picture+1.png' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-3461516725558862308</id><published>2009-10-19T19:50:00.008+13:00</published><updated>2009-10-20T11:49:28.580+13:00</updated><title type='text'>How researchers and software engineers can make use of data in Wikipedia</title><content type='html'>Wikipedia is not just an online encyclopedia, where people can effectively and free of cost look up definitions written up by other people. It is also a very powerful resource for developing useful tools for analyzing written language. Just &lt;a href="http://scholar.google.com/scholar?q=wikipedia&amp;amp;hl=en&amp;amp;btnG=Search"&gt;search for Wikipedia on Google Scholar&lt;/a&gt; and you will find over 140,000 research papers! This post is about practical application of data stored in Wikipedia.&lt;br /&gt;&lt;br /&gt;At workshops held at prestigious AI conferences on Wikipedia-related research, e.g. &lt;a href="http://lit.csci.unt.edu/%7Ewikiai08/index.php/Main_Page"&gt;WikiAI at IJCAI&lt;/a&gt; and &lt;a href="http://www.ukp.tu-darmstadt.de/acl-ijcnlp-2009-workshop/"&gt;People's Web meets NLP at ACL&lt;/a&gt;, I have learned about pretty amazing things one can implement using Wikipedia data. From computing semantic relatedness between words at a level comparable to humans to converting folksonomies of tags into ontologies. In general, the organizers make a differentiation between how Wikipedia can help AI in solving language-related tasks and how AI can help Wikipedia to improve its quality and fight vandalism.&lt;br /&gt;Let's concentrate on the former.&lt;br /&gt;&lt;br /&gt;A huge barrier here is that Wikipedia is huge (and gets bigger!). There are tools available for processing the Wikipedia dumps, but the best working I found so far is the open source Java library &lt;a href="http://wikipedia-miner.sourceforge.net/"&gt;Wikipedia Miner&lt;/a&gt;. Not just because it was developed at the University of Waikato, where I studied. The reasons are its easy installation, an intuitive object-oriented model for accessing Wikipedia data, as well as additional tools for computing similarity between any two English phrases, wikifying any text (i.e. link its phrases to concepts explained in Wikipedia) and even implemented web services. Check out the online demos:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a style="font-weight: bold;" href="http://wdm.cs.waikato.ac.nz:8080/service?task=search"&gt;Search&lt;/a&gt;. If you search for a word like &lt;a href="http://wdm.cs.waikato.ac.nz:8080/service?task=search&amp;amp;term=palm"&gt;&lt;span style="font-style: italic;"&gt;palm&lt;/span&gt;&lt;/a&gt;, it will list all the possible meanings, starting with the most likely one (&lt;a href="http://wdm.cs.waikato.ac.nz:8080/service?task=search&amp;amp;id=45715&amp;amp;term=Arecaceae"&gt;Arecaceae&lt;/a&gt; - 65% likelihood) and all other meanings, like &lt;span style="font-style: italic;"&gt;Palm (PDA)&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;hand palm&lt;/span&gt;. Clicking on a meaning shows the words used to refer to it, e.g. redirects like &lt;span style="font-style: italic;"&gt;Palmtree&lt;/span&gt; and anchors like &lt;span style="font-style: italic;"&gt;palm wood&lt;/span&gt;, as well as translations (language links) and broader terms (categories).&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://wdm.cs.waikato.ac.nz:8080/service?task=compare"&gt;&lt;span style="font-weight: bold;"&gt;Compare&lt;/span&gt;&lt;/a&gt;. Here you can calculate things like &lt;a href="http://wdm.cs.waikato.ac.nz:8080/service?task=compare&amp;amp;details=true&amp;amp;term1=wine&amp;amp;term2=vodka"&gt;&lt;span style="font-style: italic;"&gt;vodka&lt;/span&gt; is more related to &lt;span style="font-style: italic;"&gt;wine&lt;/span&gt;&lt;/a&gt; &lt;a href="http://wdm.cs.waikato.ac.nz:8080/service?task=compare&amp;amp;details=true&amp;amp;term1=water&amp;amp;term2=vodka"&gt;than to &lt;span style="font-style: italic;"&gt;water&lt;/span&gt;&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://wdm.cs.waikato.ac.nz:8080/service?task=wikify"&gt;&lt;span style="font-weight: bold;"&gt;Wikify&lt;/span&gt;&lt;/a&gt;. This feature helps finding out what concepts are discussed in any text or on any website. Very practical particularly for texts with many named entities, e.g. news articles, but not only. Here is &lt;a href="http://wdm.cs.waikato.ac.nz:8080/wikifier/index.jsp?source=http%3A%2F%2Fmaui-indexer.blogspot.com%2F&amp;amp;linkColor=rgb%280%2C0%2C255%29&amp;amp;minProbability=0.1"&gt;this blog wikified with Wikipedia Miner&lt;/a&gt; (the links are added in blue and at the highest possible density).&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Note&lt;/span&gt;: all this information can be easily accessed from your application when using the Wikipedia Miner library. I've done it in my &lt;a href="http://code.google.com/p/maui-indexer/"&gt;topic indexing tool Maui&lt;/a&gt; and can tell you that it's relatively easy.&lt;br /&gt;&lt;br /&gt;Many other people already are using Wikipedia Miner (100 downloads in the last 6 month). It has also been referenced as research tool in &lt;a href="http://scholar.google.co.nz/scholar?start=0&amp;amp;q=%22wikipedia+miner%22&amp;amp;hl=en"&gt;various published projects&lt;/a&gt;, including adding structure to search queries, finding similar videos, extending ontologies, creating language learning tools and many more.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-3461516725558862308?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/3461516725558862308/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/10/how-researchers-and-software-engineers.html#comment-form' title='8 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/3461516725558862308'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/3461516725558862308'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/10/how-researchers-and-software-engineers.html' title='How researchers and software engineers can make use of data in Wikipedia'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>8</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-2637077122772446303</id><published>2009-10-18T14:56:00.002+13:00</published><updated>2009-10-18T15:15:49.389+13:00</updated><title type='text'>Data sets for keyphrase extraction and topic indexing</title><content type='html'>A few month ago I have made available &lt;a href="http://code.google.com/p/maui-indexer/downloads/list"&gt;several collections&lt;/a&gt; that researchers can use to develop and evaluate their systems for tasks related to topic indexing. But I didn't blog about it! In the meantime people still found it and have downloaded the data. Here is a summary of collections in order of their popularity.&lt;br /&gt;&lt;br /&gt;The most popular collection is &lt;span style="font-weight: bold;"&gt;Wiki-20&lt;/span&gt;. It contains 20 computer science documents, each with main topics assigned independently by 15 teams of graduate students. So, each document has 15 sets of Wikipedia articles that represent the main topics in it, according to the students.&lt;br /&gt;This data set was downloaded 19 times. &lt;br /&gt;&lt;br /&gt;There is also a corpus that can be used for keyphrase extraction and tagging with 180 documents indexed by human taggers. It has been harvested from the data on &lt;a href="http://www.citeulike.org/"&gt;CiteULike&lt;/a&gt; and includes only high quality tags. This data set, &lt;span style="font-weight: bold;"&gt;CiteULike-180&lt;/span&gt;) was downloaded 15 times.&lt;br /&gt;&lt;br /&gt;There are two corpora for keyphrase indexing, i.e. assignment of terms from controlled vocabularies:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;FAO-780&lt;/span&gt; with 780 documents indexed by just one indexer (9 dowloads).&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;FAO-30&lt;/span&gt; with 30 documents indexed independently by 6 professional indexers each (so far 8 downloads).&lt;/li&gt;&lt;/ol&gt;This data was provided by the Food and Agriculture Organization of the United Nations. Agrovoc serves as vocabulary, but any other &lt;a href="http://www.w3.org/2004/02/skos/"&gt;vocabulary in SKOS format&lt;/a&gt; can be used in a similar way.&lt;br /&gt;&lt;br /&gt;These statistics indicate that traditional term assignment is not a popular research topic any more. Tagging and, particularly, indexing with terms from Wikipedia is what is researched more actively. Great to see that Wikipedia gets the attention it deserves and that my idea of using it as a controlled vocabulary was picked up by others.&lt;br /&gt;&lt;br /&gt;More information is also available on the &lt;a href="http://code.google.com/p/maui-indexer/wiki/MultiplyIndexedData"&gt;wiki page explaining the multiply indexed data sets&lt;/a&gt; and on a page listing other &lt;a href="http://code.google.com/p/maui-indexer/wiki/Resources"&gt;resources for topic indexing&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-2637077122772446303?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/2637077122772446303/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/10/data-sets-for-keyphrase-extraction-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/2637077122772446303'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/2637077122772446303'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/10/data-sets-for-keyphrase-extraction-and.html' title='Data sets for keyphrase extraction and topic indexing'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-3863926984441837801</id><published>2009-07-29T12:21:00.002+12:00</published><updated>2009-07-29T12:31:35.965+12:00</updated><title type='text'>Wiki pages about topic indexing</title><content type='html'>Here is a list of updates on Maui's google code page, which will hopefully make it easier for others to use Maui and experiment with its data sets:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/maui-indexer/downloads/list"&gt;Download Maui&lt;/a&gt; page now lists the new version (1.1) and several corpora that can be used for creating topic indexing models and testing.&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/maui-indexer/wiki/Installation"&gt;A step-by-step installation guide&lt;/a&gt; shows how to download, install and use Maui.&lt;/li&gt;&lt;li&gt;There is also &lt;a href="http://code.google.com/p/maui-indexer/wiki/Examples"&gt;a full page of examples of automatically generated topics&lt;/a&gt;.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/maui-indexer/wiki/MultiplyIndexedData"&gt;Multiply indexed data&lt;/a&gt; wiki page explains three data sets with topics assigned to the same document by multiple people. These data sets are very useful for the evaluation. This page also explains how to measure inter-indexer consistency on a simple example.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/maui-indexer/wiki/Resources"&gt;Resources for keyphrase extraction and term assignment&lt;/a&gt; list further useful data sets.&lt;/li&gt;&lt;/ul&gt;Let me know if something is missing.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-3863926984441837801?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/3863926984441837801/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/07/wiki-pages-about-topic-indexing.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/3863926984441837801'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/3863926984441837801'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/07/wiki-pages-about-topic-indexing.html' title='Wiki pages about topic indexing'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-7976785700276288410</id><published>2009-07-16T00:08:00.005+12:00</published><updated>2009-07-16T10:43:06.534+12:00</updated><title type='text'>Updated release and French term assignment example</title><content type='html'>Turns out I had two pending issues on Google Code, where I host the &lt;a href="http://maui-indexer.googlecode.com/"&gt;Maui algorithm&lt;/a&gt;. Per default the project owner does not gets a notification!&lt;br /&gt;&lt;br /&gt;So today I went ahead and fixed one of the requests: to have everything in a jar file.&lt;br /&gt;I've also updated the example files and added Javadoc documentation. Soon I will publish a detailed installation instruction (additionally to the &lt;a href="http://maui-indexer.blogspot.com/2009/06/how-to-use-maui.html"&gt;usage instructions&lt;/a&gt;), but for now just this one command line example. It shows how to create a topic indexing model and apply it to new document on the example of term assignment with French documents. &lt;a href="http://code.google.com/p/maui-indexer/downloads/list"&gt;Download the latest release of Maui (1.1)&lt;/a&gt; and then try this:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;font-family:courier new;" &gt;java -Xmx1024m -classpath maui-1.0.jar  maui.main.FrenchExample&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;If the Java classpath is not yet linked to Maui's libraries, add this after &lt;span style="font-family:courier new;"&gt;maui-1.0.jar&lt;/span&gt;:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;font-family:courier new;" &gt;":lib/weka.jar:lib/wikipediaminer1.1.jar:lib/trove.jar:lib/jena.jar:lib/icu4j_3_4.jar:lib/iri.jar:lib/xercesImpl.jar:lib/snowball.jar:lib/mysql-connector-java-3.1.13-bin.jar:lib/maxent-2.4.0.jar:lib/commons-logging.jar"&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;The output should be something like:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-- Building the model... &lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;--- Loading the vocabulary...&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;--- Building the Vocabulary index from the SKOS file...&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;...&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;-- Reading the input documents... &lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;...&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;--- Computing candidates...&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;...&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;--- Building classifier&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;...&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;-- Extracting keyphrases... &lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;-- Keyphrases and feature values:&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;http://www.fao.org/aos/agrovoc#c_4830,'Produit laitier',0,0,0.003276,0.00335,...,False&lt;/span&gt;&lt;span style="font-family:courier new;"&gt;&lt;br /&gt;http://www.fao.org/aos/agrovoc#c_7848,Commerce,0,0,0.003348,...,2,True&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;http://www.fao.org/aos/agrovoc#c_3919,'Commerce international',...,True&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;http://www.fao.org&lt;br /&gt;/aos/agrovoc#c_4826,Lait,...,False&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;http://www.fao.org/aos/agrovoc#c_8288,Volume,...,False&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;http://www.fao.org/aos/agrovoc#c_25201,Usine,...,False&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;http://www.fao.org/aos/agrovoc#c_714,Australie,...,False&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;http://www.fao.org/aos/agrovoc#c_8323,'Besoin en eau',...,False&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-- 2.0 correct&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;-- Evaluation results based on 1 document:&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Avg. number of correct keyphrases per document: 2 +/- 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Precision: 25 +/- 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;Recall: 13.33 +/- 0&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:courier new;"&gt;F-Measure: 17.39&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For each test document (in this case, just one is used), Maui outputs its Agrovoc ID, name of the concept (e.g. &lt;span style="font-style: italic;"&gt;Besoin en eau&lt;/span&gt;), some feature values and True or False, depending on whether this term as been assigned to this document by a human indexer. Based on these values evaluation is performed. Because the directory already contains a &lt;span style="font-style: italic;"&gt;.key&lt;/span&gt; file, Maui does not override it, otherwise it would create one with automatically generated topics.&lt;br /&gt;&lt;br /&gt;This is of course just a demonstration: after training on just two documents and testing on a third one. But at least it shows (I hope!) how simple Maui's usage can be.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-7976785700276288410?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/7976785700276288410/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/07/updated-release-and-french-term.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/7976785700276288410'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/7976785700276288410'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/07/updated-release-and-french-term.html' title='Updated release and French term assignment example'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-7349938960699377205</id><published>2009-07-13T16:57:00.006+12:00</published><updated>2009-07-14T22:06:20.424+12:00</updated><title type='text'>Useful web resources related to automatic topic indexing</title><content type='html'>&lt;span style="font-weight: bold;"&gt;Tools (in alphabetical order):&lt;br /&gt;&lt;/span&gt;for keyword and keyphrase extraction, tagging, autotagging, terminology extraction, term assignment, text classification, text categorization, topical metadata extraction and topic indexing with Wikipedia&lt;ul&gt;&lt;li&gt;&lt;a href="http://invenio-demo.cern.ch/help/admin/bibclassify-admin-guide"&gt;Bibclassify&lt;/a&gt; - A module in CDS Invenio (CERN’s document server software) for automatic assignment of terms from SKOS vocabularies, developed on the High Energy Physics vocabulary. Developed in the collaboration between CERN and DESY. There is also a &lt;a href="http://cdsware.cern.ch/tmp/bibclassify/hacking.html"&gt;hacking guide&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.extractor.com/"&gt;Extractor&lt;/a&gt; - Commercial software for keyword extraction in different languages. There is also a &lt;a href="http://www.extractorlive.com/on_line_demo.html"&gt;demo&lt;/a&gt;. Developed at the National Research Council of Canada.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.nzdl.org/Kea"&gt;Keyphrase extraction algorithm Kea&lt;/a&gt;. Can be used for both automatic keyphrase extraction and term assignment with controlled vocabularies. Developed at the University of Waikato.&lt;br /&gt;&lt;br /&gt;&lt;span style="text-decoration: underline;"&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://maui-indexer.googlecode.com/"&gt;Multi-purpose topic indexing algorithm Maui&lt;/a&gt;. Suitable for automatic term assignment, subject indexing, keyword extraction, keyphrase extraction, indexing with Wikipedia, autotagging, terminology extraction. Developed at the University of Waikato. Maui is also available on &lt;a href="http://maui-indexer.sourceforge.net/"&gt;sourceforge&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.nactem.ac.uk/software/termine/"&gt;TerMine&lt;/a&gt; - a term extraction tool developed at the &lt;a href="http://www.nactem.ac.uk/index.php"&gt;National Centre for Text Mining&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://pypi.python.org/pypi/topia.termextract/1.1.0"&gt;Topia term extractor&lt;/a&gt; - Part-of-speech and frequency based term extraction tool implemented in python. Here is a &lt;a href="http://fivefilters.org/term-extraction/"&gt;term extraction demo&lt;/a&gt; based on this tool implemented by &lt;a href="http://fivefilters.org/"&gt;FiveFilters.org&lt;br /&gt;&lt;br /&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.alchemyapi.com/api/keyword/"&gt;Orchestr8 Keyword Extraction&lt;/a&gt; - An API based application that uses statistical and natural language processing methods. Applicable to webpages, text files and any input text in several languages.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.wikifyer.com/"&gt;Wikifier&lt;/a&gt; – An online demo of detecting Wikipedia articles in text developed at the Language and Information Technologies research group at the University of North Texas.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://wikipedia-miner.sourceforge.net/"&gt;Wikipedia Miner&lt;/a&gt; – An API for accessing Wikipedia data, which also provides a tool for mapping any document to a set of relevant Wikipedia articles, similar to indexing with Wikipedia. Developed at the University of Waikato. &lt;a href="http://wdm.cs.waikato.ac.nz:8080/service?task=wikify"&gt;Demo 1&lt;/a&gt; and &lt;a href="http://wdm.cs.waikato.ac.nz:8080/wikifier/"&gt;demo 2&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://seokeywordanalysis.com/seotools/"&gt;SEO keyword extraction&lt;/a&gt; - An online keyword and keyphrase extraction tool for search engine optimization&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.oclc.org/research/software/scorpion/default.htm"&gt;Scorpion&lt;/a&gt; – OCLC’s tool for automatic classification of documents.&lt;br /&gt;&lt;br /&gt;&lt;span style="text-decoration: underline;"&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://tagthe.net/"&gt;Tagthe.net&lt;/a&gt; – A demo and an API for automatic tagging of web documents and texts. Tags can be single words only. The tool also recognizes named entities such as people names and locations.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://developer.yahoo.com/search/content/V1/termExtraction.html"&gt;Yahoo term extraction&lt;/a&gt; - Web-service based content analysis via term extraction, includes a demo.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;Vocabularies in SKOS format and test data&lt;/span&gt;:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://id.loc.gov/authorities/search/"&gt;Library of Congress Subject Headings LSCH&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://thesauri.cs.vu.nl/eswc06/mesh/rdf/meshdata.rdf"&gt;Medical Subject Headings thesaurus MeSH&lt;/a&gt;&lt;/li&gt;&lt;li&gt;FAO’s agricultural thesaurus Agrovoc: &lt;a href="http://www.fao.org/agrovoc/"&gt;general info&lt;/a&gt; and &lt;a href="http://aims.fao.org/en/website/Download/sub"&gt;download site&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;&lt;a href="http://aims.fao.org/en/website/Knowledge-Organization-Systems-%28KOS%29/sub"&gt;List of other SKOS thesauri at FAO&lt;/a&gt; &lt;/li&gt;&lt;li&gt;&lt;a href="http://invenio-demo.cern.ch/help/hacking/bibclassify-hep-taxonomy"&gt;DESY’s High Energy Physics HEP thesaurus&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://esw.w3.org/topic/SkosDev/DataZone"&gt;W3C’s list of SKOS thesauri&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/maui-indexer/wiki/MultiplyIndexedData"&gt;Maui’s datasets&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://aye.comp.nus.edu.sg/downloads/keyphraseCorpus/"&gt;Keyphrase extraction data set&lt;br /&gt;&lt;br /&gt;&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;More resources&lt;/span&gt;:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://ii.nlm.nih.gov/"&gt;NLM Indexing Initiative&lt;/a&gt; – Website about National Library of Medicine’s project on automatic indexing using MeSH terms. Research details, evaluation and examples.&lt;/li&gt;&lt;li&gt;&lt;a href="http://dublincore.org/tools/"&gt;Dublin Core tools&lt;/a&gt; – A list of tools for automatic extraction of Dublin Core metadata&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.asindexing.org/site/software.shtml"&gt;ASI resources&lt;/a&gt; – List of commercial software for back-of-the-book indexing by American Society of Indexing &lt;/li&gt;&lt;li&gt;&lt;a href="http://www.anzsi.org/site/software.asp"&gt;ANZSI resources&lt;/a&gt; – List of software tools provided by the Australian and New Zealand Society of Indexing&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-7349938960699377205?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/7349938960699377205/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/07/useful-web-resources-related-to.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/7349938960699377205'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/7349938960699377205'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/07/useful-web-resources-related-to.html' title='Useful web resources related to automatic topic indexing'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-6091378973501237304</id><published>2009-07-08T14:37:00.005+12:00</published><updated>2009-07-16T10:43:48.659+12:00</updated><title type='text'>What do subject indexing, keyphrase extraction and autotagging have in common? Terminology clarification</title><content type='html'>There has been a lot of confusion about tasks related to topic indexing. Here is an overview of these tasks, terms used to refer to them and what they stand for.&lt;br /&gt;&lt;ol&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Text categorization&lt;/span&gt;    (or: text classification) -  Very few general categories, like &lt;span style="font-style: italic;"&gt;Politics&lt;/span&gt; or &lt;span style="font-style: italic;"&gt;News&lt;/span&gt;, are assigned usually from a relatively small vocabulary.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Term assignment &lt;/span&gt;(or: subject indexing) -  Document's main topics are expressed using terms from a large vocabulary, e.g. a domain-specific thesaurus.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Keyphrase extraction&lt;/span&gt; (or: keyword extraction, key term extraction) - Document's main topics are expressed using the most prominent words and phrases in a document.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Terminology extraction&lt;/span&gt; (similar to back-of-the-book indexing) - All domain relevant words and phrases are extracted from a document.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Full-text indexing&lt;/span&gt; (or: full indexing, free text indexing) - All words and phrases, sometimes excluding the stopwords, are extracted from a document.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Keyphrase indexing&lt;/span&gt;  (or: keyphrase assignment) - A general term, which refers to both term assignment and keyphrase extraction.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Tagging&lt;/span&gt; (or: collaborative tagging, social tagging and when performed automatically: autotagging, automatic tagging) - The user defines as many topics as desired. Any word or phrase can serve as a tag. Prevalently applied on collaborative websites.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-weight: bold;"&gt;Clustering&lt;/span&gt; is related to topic indexing in that it identifies groups of documents on the same topic; however, these groups are unlabeled.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-6091378973501237304?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/6091378973501237304/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/07/what-do-subject-indexing-keyphrase.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/6091378973501237304'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/6091378973501237304'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/07/what-do-subject-indexing-keyphrase.html' title='What do subject indexing, keyphrase extraction and autotagging have in common? Terminology clarification'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-6825098983157669994</id><published>2009-06-28T11:49:00.002+12:00</published><updated>2009-07-02T00:16:29.048+12:00</updated><title type='text'>How to generate visualizations of topics with Dotty and GraphViz</title><content type='html'>In a previous post, I have shown &lt;a href="http://maui-indexer.blogspot.com/2009/06/document-keywords-from-wikipedia.html"&gt;Wikipedia article titles that Maui assigned to the introduction of my thesis&lt;/a&gt;. To create a nice visualization of these topics, instead of the less expressive tag clouds, I used &lt;a href="http://wikipedia-miner.sourceforge.net/"&gt;WikipediaMiner&lt;/a&gt; and Dotty (via &lt;a href="http://www.graphviz.org/"&gt;GraphViz&lt;/a&gt; for Mac).&lt;br /&gt;&lt;br /&gt;I have written a script that generates a &lt;span style="font-style: italic;"&gt;.gv&lt;/span&gt; file with the following content:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;graph G {&lt;br /&gt;"Machine learning" -- "Keywords"[style=invis];&lt;br /&gt;"Machine learning" -- "Natural language" [penwidth = "3"];&lt;br /&gt;"Machine learning" -- "Knowledge" [penwidth = "2"];&lt;br /&gt;...&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;The nodes are the names of the topics and the links are expressed using semantic relatedness values computed using WikipediaMiner. The values are modified to reflect the line width in the generated graph:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;span style="font-style: italic;"&gt;style=invis&lt;/span&gt; if relatedness = 0;&lt;/li&gt;&lt;li&gt;otherwise &lt;span style="font-style: italic;"&gt;penwidth&lt;/span&gt; is used. &lt;/li&gt;&lt;/ul&gt;The number is the original relatedness value (between 0 and 1) multiplied by 10 and made into an integer.&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(204, 0, 0);"&gt;&lt;span style="font-weight: bold;"&gt;Update:&lt;/span&gt; Additionally the top keyphrase returned by the algorithm is defined as the root of the graph (e.g. &lt;span style="font-family:courier new;"&gt;graph [root="Index (search engine)"] &lt;/span&gt;for the graph below). Also the font size reflects the rank of the keyphrase as determined by the algorithm.&lt;/span&gt;&lt;br /&gt;&lt;pre&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.cs.waikato.ac.nz/%7Eolena/introduction.gif"&gt;&lt;img style="margin: 0px auto 10px; display: block; text-align: left; cursor: pointer; width: 423px; height: 328px;" src="http://www.cs.waikato.ac.nz/%7Eolena/introduction.gif" alt="" border="0" /&gt;&lt;/a&gt;&lt;/pre&gt; Click on the graph to see the &lt;a href="http://www.cs.waikato.ac.nz/%7Eolena/introduction.gif"&gt;graph in full resolution&lt;/a&gt;. Or view the complete &lt;a href="http://www.cs.waikato.ac.nz/%7Eolena/introduction.gv.txt"&gt;GraphViz script&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The beauty of GraphViz is that the generated graph can be exported into any format and expanded to any required size.&lt;br /&gt;&lt;br /&gt;Furthermore, the visualization can be generated for any list of topics, as long as they can be mapped to titles of Wikipedia articles:&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;// first check if there is an article with the title "topic"&lt;br /&gt;Article article = wikipedia.getArticleByTitle(topic);&lt;br /&gt;&lt;br /&gt;// otherwise retrieve the most likely mapping&lt;br /&gt;if (article == null) {&lt;br /&gt;CaseFolder cs = new CaseFolder();&lt;br /&gt;article = wikipedia.getMostLikelyArticle(topic,cs);&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The script for generating such graphs is included as a part of &lt;a href="http://maui-indexer.googlecode.com/"&gt;Maui software&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-6825098983157669994?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/6825098983157669994/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/06/how-to-generate-visualizations-of.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/6825098983157669994'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/6825098983157669994'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/06/how-to-generate-visualizations-of.html' title='How to generate visualizations of topics with Dotty and GraphViz'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-5694341646778585613</id><published>2009-06-27T16:01:00.000+12:00</published><updated>2009-06-27T17:43:19.353+12:00</updated><title type='text'>How often taggers agree with each other?</title><content type='html'>... or better how often taggers of my thesis have agreed with each other?&lt;br /&gt;&lt;br /&gt;Nine of my friends (all Computer Scientists graduates and IT professionals), who all helped me with proof reading my thesis &lt;span style="font-style: italic;"&gt;Human-competitive automatic topic indexing&lt;/span&gt;, choose five tags that describe its main topics. Each one was familiar with my work, read parts of the thesis and the &lt;a href="http://www.medelyan.com/news/topic-indexing"&gt;abstract&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;General impression&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There was no tag on which the nine people agreed! Five of them picked &lt;span style="font-style: italic;"&gt;tagging&lt;/span&gt;, although this is only one of the three tasks that is addressed in the thesis. There was a struggle with compound words like &lt;span style="font-style: italic;"&gt;topic indexing&lt;/span&gt; (should it be just &lt;span style="font-style: italic;"&gt;indexing&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;topics&lt;/span&gt; or &lt;span style="font-style: italic;"&gt;document topics&lt;/span&gt;?) and with adjectives (should they be used as separate tags, e.g. &lt;span style="font-style: italic;"&gt;automatic&lt;/span&gt;, &lt;span style="font-style: italic;"&gt;statistical&lt;/span&gt; or as modifiers of existing tags, e.g. &lt;span style="font-style: italic;"&gt;automatic tagging&lt;/span&gt;).&lt;br /&gt;&lt;br /&gt;One of the people picked &lt;span style="font-style: italic;"&gt;controlled vocabularies&lt;/span&gt;, another &lt;span style="font-style: italic;"&gt;controlled-vocabulary&lt;/span&gt;. When comparing the results, I treated these tags as the same thing, however, I didn't do it with other tags, which also represented the same thing but were expressed slightly different: &lt;span style="font-style: italic;"&gt;topic indexing&lt;/span&gt; and &lt;span style="font-style: italic;"&gt;topic assignment&lt;/span&gt;. In general, everyone agreed on the general topics but expressed them differently.&lt;br /&gt;&lt;br /&gt;Two topic clouds (same tags, but different layout) show all tags assigned by everyone to the thesis:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;Tag cloud 1&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;div style="text-align: center;"&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;algorithm       automatic&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;automatic tagging&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;   &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;&lt;span style="font-size:100%;"&gt;automatic topic indexing&lt;/span&gt;&lt;br /&gt;auto-tagging&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;artificial intelligence&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;&lt;span style="font-size:100%;"&gt;computer science &lt;/span&gt;&lt;br /&gt;&lt;span style="font-size:100%;"&gt;controlled vocabularies&lt;/span&gt;&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;document categorization&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;document topics&lt;br /&gt;domain-specific knowledge&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;    encyclopedic knowledge&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;   human competitive&lt;br /&gt;human indexing&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;&lt;span style="font-size:180%;"&gt;indexing&lt;/span&gt;&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;indexing  methods&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;kea&lt;br /&gt;&lt;span style="font-size:100%;"&gt;keyphrase extraction&lt;/span&gt;&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;&lt;span style="font-size:130%;"&gt;machine learning&lt;/span&gt;&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;natural language processing&lt;br /&gt;&lt;span style="font-weight: bold;font-size:180%;" &gt;tagging&lt;/span&gt;&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;tag hierarchies&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;taxonomies&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;term assignment&lt;br /&gt;&lt;span style="font-size:100%;"&gt;topic indexing&lt;/span&gt;&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;topic assignment&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;topics        semantic&lt;br /&gt;supervised learning&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;statistical&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;       &lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;&lt;span style="font-size:180%;"&gt;wikipedia&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;Tag cloud 2&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="color: rgb(51, 51, 255); font-weight: bold;font-size:180%;" &gt;tagging&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:180%;" &gt;   wikipedia     indexing&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:130%;" &gt;    machine learning&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:100%;" &gt;&lt;br /&gt;topic indexing controlled vocabularies   keyphrase extraction&lt;br /&gt;computer science     automatic topic indexing&lt;/span&gt;&lt;span style="color: rgb(51, 51, 255);font-size:85%;" &gt;&lt;span style="color: rgb(51, 102, 255);"&gt; &lt;/span&gt; algorithm&lt;br /&gt;automatic       automatic tagging       auto-tagging           artificial intelligence&lt;br /&gt;document categorization       document topics       domain-specific knowledge&lt;br /&gt;encyclopedic knowledge       human competitive           human indexing&lt;br /&gt;indexing methods       kea       natural language processing       tag hierarchies&lt;br /&gt;taxonomies       term assignment    topic assignment       topics           semantic&lt;br /&gt;supervised learning           statistical&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;Consistency of taggers of my thesis &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Consistency analysis is a traditional way of assessing indexing quality (more on this below). I applied this metric to evaluate tags assigned to my thesis and here are the results:&lt;br /&gt;&lt;br /&gt;A  -  &lt;span style="font-style: italic;"&gt;22.5&lt;/span&gt;&lt;br /&gt;D  -  &lt;span style="font-style: italic;"&gt;16.1&lt;/span&gt;&lt;br /&gt;G  -  &lt;span style="font-style: italic;"&gt;20.3&lt;/span&gt;&lt;br /&gt;J  -  &lt;span style="font-style: italic;"&gt;2.5&lt;/span&gt;&lt;br /&gt;K  -  &lt;span style="font-style: italic;"&gt;27.8&lt;/span&gt;&lt;br /&gt;N  -  &lt;span style="font-style: italic;"&gt;15.0&lt;/span&gt;&lt;br /&gt;S  -  &lt;span style="font-style: italic;"&gt;5.4&lt;/span&gt;&lt;br /&gt;T1  -  &lt;span style="font-style: italic;"&gt;8.3&lt;/span&gt;&lt;br /&gt;T2  -  &lt;span style="font-style: italic;"&gt;29.6&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;The average consistency in this group is 16.4%, with the best tagger achieving nearly 30%.&lt;br /&gt;These results were based on only one document and therefore are only a guideline.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;About indexing consistency&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;In traditional library, inter-indexer consistency analysis is used to assess how well professionals assign subject headings to library holdings. The higher is the consistency, the better will be topic-based search in the catalog. It is a logical consequence, because if a librarians agrees with his colleague, it is likely to agree with the patron.&lt;br /&gt;&lt;br /&gt;Admittedly, tagging is slightly different. Taggers, who assign tags for their own use, chose them based on personal preferences that might not be of use to others. But since tags are widely used by others, their quality is as important as that of subject headings in libraries.&lt;br /&gt;&lt;br /&gt;In one of the experiments, &lt;a href="http://maui-indexer.blogspot.com/2009/06/maui-will-be-presented-at-emnlp09.html"&gt;reported earlier&lt;/a&gt;,  I have analyzed consistency of taggers on &lt;a href="http://www.citeulike.org/"&gt;CiteULike&lt;/a&gt; and used it to evaluate automatic tags produced by &lt;a href="http://maui-indexer.googlecode.com/"&gt;Maui&lt;/a&gt;. The consistency of taggers varied from 0 to 91%, with an average of 18.5%. Thus, my friends performed pretty much as good as an average CiteULike tagger.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-5694341646778585613?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/5694341646778585613/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/06/how-often-taggers-agree-with-each-other.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/5694341646778585613'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/5694341646778585613'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/06/how-often-taggers-agree-with-each-other.html' title='How often taggers agree with each other?'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-8812820765991491633</id><published>2009-06-26T23:01:00.000+12:00</published><updated>2009-06-26T23:16:39.932+12:00</updated><title type='text'>Document keywords from Wikipedia</title><content type='html'>Maui can be used for identifying the main topics in a document and expressing them in form of well-formed titles of Wikipedia articles.&lt;br /&gt;&lt;br /&gt;To create a training model I used a collection with &lt;a href="http://www.cs.waikato.ac.nz/%7Eolena/wikipedia.html"&gt;20 computer science papers indexed by graduate students&lt;/a&gt; in an indexing experiment for my AAAI paper "&lt;a href="http://www.cs.waikato.ac.nz/%7Eolena/publications/TopicIndexingWithWikipediaWIKIAI08.pdf"&gt;Topic indexing with Wikipedia&lt;/a&gt;".&lt;br /&gt;&lt;br /&gt;Then I applied this model to generate ten keywords and keyphrases for the Introduction chapter of my PhD Thesis. I am very pleased with the results:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Index%20%28search%20engine%29"&gt;Index (search engine)&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Natural%20language"&gt;Natural language&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Keywords"&gt;Keywords&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Algorithm"&gt;Algorithm&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Index%20%28information%20technology%29"&gt;Index (information technology)&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Machine%20learning"&gt;Machine learning&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Training%20set"&gt;Training set&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Information%20retrieval"&gt;Information retrieval&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Dissertation"&gt;Dissertation&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://en.wikipedia.org/wiki/Knowledge"&gt;Knowledge&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;br /&gt;The best thing about keyphrase extraction with Wikipedia is that it links each term to a concept -- Wikipedia article explaining this term. Thus, the resulting keyphrase sets will be consistent: they will all refer to "Dissertation" and not "PhD thesis" or "Doctoral thesis", disregarding of which term is actually used in the document to express this concept.&lt;br /&gt;&lt;br /&gt;Each topic can be directly linked to the Wikipedia article by adding "http://en.wikipedia.org/wiki/" in front of the term as I did in the above list.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-8812820765991491633?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/8812820765991491633/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/06/document-keywords-from-wikipedia.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/8812820765991491633'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/8812820765991491633'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/06/document-keywords-from-wikipedia.html' title='Document keywords from Wikipedia'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-8105546156461040756</id><published>2009-06-26T21:11:00.000+12:00</published><updated>2009-06-26T21:30:40.304+12:00</updated><title type='text'>How to use Maui</title><content type='html'>Here are some usage instructions that are also published on &lt;a href="http://code.google.com/p/maui-indexer/wiki/Usage"&gt;Maui's wiki page&lt;/a&gt;.&lt;br /&gt;&lt;h2&gt;Preparing the data&lt;/h2&gt;After Maui is installed, there are two ways of using it: from the command line and from the Java code. Either way, the input data is required first. The &lt;span style="font-style: italic;"&gt;data&lt;/span&gt; directory in Maui's download package contains some examples of input data.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;1. Formatting the document files.&lt;/span&gt;&lt;br /&gt;Each document has to be stored individually in text form in a file with extension &lt;span style="font-style: italic;"&gt;.txt&lt;/span&gt;. Maui takes as an input the name of the directory with such files. If a model needs to be created first, the same directory should contain main topics assigned manually to each document.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;2. Formatting the topic files.&lt;/span&gt;&lt;br /&gt;The topic sets need to be stored individually in text form, one topic per line, in a file with the same name as the document text, but with the extension &lt;span style="font-style: italic;"&gt;.key&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;3. Maui's Output.&lt;/span&gt;&lt;br /&gt;If Maui is used to generate main topics for new documents, it will create &lt;span style="font-style: italic;"&gt;.key&lt;/span&gt; files for each document in the input directory. If topics are generated, but &lt;span style="font-style: italic;"&gt;.key&lt;/span&gt; files are already existent, the existing topics are used as gold standard for the evaluation of automatically extracted ones.&lt;br /&gt;&lt;h2&gt;Command line usage&lt;/h2&gt;Maui can be used directly from the command line. The general command is:&lt;br /&gt;&lt;pre&gt;java maui.main.MauiModelBuilder&lt;br /&gt;(or maui.main.MauiTopicExtractor)&lt;br /&gt;             -l directory   (directory with the data)&lt;br /&gt;             -m model       (model file)&lt;br /&gt;             -v vocabulary  (vocabulary name)&lt;br /&gt;             -f {skos|text} (vocabulary format)&lt;br /&gt;             -w database@server (wikipedia location)&lt;/pre&gt;Which class is used depends on the current mode of topic indexing.&lt;span style="font-weight: bold;"&gt; MauiModelBuilder&lt;/span&gt; is used when a topic indexing model is created from documents with existing topics.&lt;span style="font-weight: bold;"&gt; MauiTopicExtractor&lt;/span&gt; is used when a model is created, to assign topics to new documents.&lt;br /&gt;&lt;br /&gt;Examples with experimental data are supplied in the Maui package. The following commands refer to the directories with this data. They correspond to different topic indexing tasks:&lt;br /&gt;&lt;br /&gt;1. &lt;span style="font-weight: bold;"&gt;Automatic tagging&lt;/span&gt; and &lt;span style="font-weight: bold;"&gt;keyphrase extraction&lt;/span&gt; - when topics are extracted from document text itself.&lt;br /&gt;&lt;pre&gt;MauiModelBuilder -l data/automatic_tagging/train/&lt;br /&gt;                -m tagging_model&lt;/pre&gt; &lt;pre&gt;MauiTopicExtractor -l data/automatic_tagging/test/&lt;br /&gt;                  -m tagging_model&lt;/pre&gt;2. &lt;span style="font-weight: bold;"&gt;Term assignment&lt;/span&gt; - when topics are taken from a controlled vocabulary in &lt;a href="http://esw.w3.org/topic/SkosDev/DataZone%20"&gt;SKOS&lt;/a&gt; format.&lt;br /&gt;&lt;pre&gt;MauiModelBuilder -l data/term_assignment/train/&lt;br /&gt;                -m assignment_model&lt;br /&gt;                -v agrovoc&lt;br /&gt;                -f skos&lt;/pre&gt; &lt;pre&gt;MauiTopicExtractor -l data/term_assignment/test/&lt;br /&gt;                  -m assignment_model&lt;br /&gt;                  -v agrovoc&lt;br /&gt;                  -f skos&lt;/pre&gt;3. &lt;span style="font-weight: bold;"&gt;Topic indexing with Wikipedia&lt;/span&gt; - when topics are Wikipedia article titles. Note in this case &lt;a href="http://wikipedia-miner.sourceforge.net/"&gt;WikipediaMiner&lt;/a&gt; needs to be installed and running first.&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;MauiModelBuilder -l data/wikipedia_indexing/train/&lt;br /&gt;                -m indexing_model&lt;br /&gt;                -v wikipedia&lt;br /&gt;                -w enwiki@localhost&lt;/pre&gt; &lt;pre&gt;MauiTopicExtractor -l data/wikipedia_indexing/test/&lt;br /&gt;                  -m indexing_model&lt;br /&gt;                  -v wikipedia&lt;br /&gt;                  -w enwiki@localhost&lt;/pre&gt;For &lt;span style="font-weight: bold;"&gt;terminology extraction&lt;/span&gt; use the command line argument &lt;span style="font-style: italic;"&gt;-n&lt;/span&gt; set to a high value to extract all possible candidate topics in the document.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-8105546156461040756?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/8105546156461040756/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/06/how-to-use-maui.html#comment-form' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/8105546156461040756'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/8105546156461040756'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/06/how-to-use-maui.html' title='How to use Maui'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-4519186005515089797</id><published>2009-06-04T21:35:00.000+12:00</published><updated>2009-06-27T16:08:59.731+12:00</updated><title type='text'>Maui will be presented at EMNLP'09</title><content type='html'>The paper &lt;b&gt;Human-competitive tagging using automatic keyphrase extraction&lt;/b&gt;, co-authored with &lt;a href="http://www.cs.waikato.ac.nz/%7Eeibe" rel="nofollow"&gt;Eibe Frank&lt;/a&gt; and &lt;a href="http://www.cs.waikato.ac.nz/%7Eihw" rel="nofollow"&gt;Ian Witten&lt;/a&gt;, was accepted for the &lt;a href="http://conferences.inf.ed.ac.uk/emnlp09/" rel="nofollow"&gt;EMNLP conference&lt;/a&gt; (Conference on Empirical Methods in Natural Language Processing) in Singapore this August.&lt;br /&gt;&lt;br /&gt;In this paper, we evaluate the tagging quality of the users of the collaborative platform &lt;a href="http://www.citeulike.org/" rel="nofollow"&gt;CiteULike&lt;/a&gt; using traditional methods. The inter-indexer consistency compares how well human taggers agree with their co-taggers on what tags should be assigned to a particular document. The higher the agreement, the higher the usefulness of the assigned tags.&lt;br /&gt;&lt;br /&gt;The consistency of CiteULike's taggers resembles the power-low distribution, with a few users achieving high consistency values and a long tail of inconsistent taggers. On average, the taggers are 19% consistent with each other. Based on the consistency values we identify a group of best performing taggers and use them as a gold standard to evaluate an automatic tagging technique. There are 35 such taggers, who achieve an average consistency of 38%.&lt;br /&gt;&lt;br /&gt;Next, we apply the &lt;a href="http://www.nzdl.org/Kea" rel="nofollow"&gt;keyphrase extraction algorithm Kea&lt;/a&gt; and the &lt;a href="http://code.google.com/p/maui-indexer/" rel="nofollow"&gt;topic indexing algorithm Maui&lt;/a&gt; to the documents tagged by these users and compare the automatically assigned tags to those assigned by humans. Maui's consistency with all taggers is 24%, 5 percentage points higher than that of humans. Maui's consistency with best taggers is 35%. Slightly less than their consistency with each other, but still better than that of 17 out of 35 taggers.&lt;br /&gt;&lt;br /&gt;The approach is a combination of machine learning and statistical and linguistic analysis of words and phrases that appear in the documents. Maui also uses the online encyclopaedia Wikipedia as a knowledge base for computing semantic information about words.&lt;br /&gt;&lt;br /&gt;Interested? Read the full paper: &lt;a href="http://www.cs.waikato.ac.nz/~olena/publications/emnlp2009_maui.pdf"&gt;Human-competitive tagging using automatic keyphrase extraction&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-4519186005515089797?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/4519186005515089797/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/06/maui-will-be-presented-at-emnlp09.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/4519186005515089797'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/4519186005515089797'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/06/maui-will-be-presented-at-emnlp09.html' title='Maui will be presented at EMNLP&apos;09'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-5105588576306787383</id><published>2009-05-30T10:12:00.000+12:00</published><updated>2009-05-30T10:19:48.360+12:00</updated><title type='text'>What is Maui about?</title><content type='html'>The Maui topic indexing algorithm was created as a part of my PhD in Computer Science at the University of Waikato.&lt;br /&gt;The title of the thesis (currently in progress) is &lt;b&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:130%;"&gt;Human-competitive automatic topic indexing&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Here is its abstract, which sums up what the algorithm is about:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-family:georgia;" &gt;&lt;/span&gt;&lt;blockquote&gt;&lt;span style="font-weight: bold;font-family:georgia;" &gt;Topic indexing&lt;/span&gt;&lt;span style="font-family:georgia;"&gt; is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document’s topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt; Three kinds of indexing are investigated: &lt;/span&gt;&lt;span style="font-weight: bold;font-family:georgia;" &gt;term assignment&lt;/span&gt;&lt;span style="font-family:georgia;"&gt;, a task commonly performed by librarians, who select topics from a controlled vocabulary; &lt;/span&gt;&lt;span style="font-weight: bold;font-family:georgia;" &gt;tagging&lt;/span&gt;&lt;span style="font-family:georgia;"&gt;, a popular activity of web users, who choose topics freely; and a new method of &lt;/span&gt;&lt;span style="font-weight: bold;font-family:georgia;" &gt;keyphrase extraction&lt;/span&gt;&lt;span style="font-family:georgia;"&gt;, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-family:georgia;"&gt; This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is “human-competitive” because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages.&lt;/span&gt;&lt;/blockquote&gt;&lt;span style="font-family:georgia;"&gt;&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-5105588576306787383?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/5105588576306787383/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/05/what-is-maui-about.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/5105588576306787383'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/5105588576306787383'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/05/what-is-maui-about.html' title='What is Maui about?'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7311448782714063415.post-2150507074911583382</id><published>2009-05-19T16:30:00.000+12:00</published><updated>2009-05-19T16:34:49.958+12:00</updated><title type='text'>Maui - first file release!</title><content type='html'>Today I have submitted the first version of Maui to &lt;a href="http://maui-indexer.sourceforge.net/"&gt;Sourceforge&lt;/a&gt; and currently adding Maui to &lt;a href="http://code.google.com/p/maui-indexer/"&gt;Google code&lt;/a&gt;. In a later post I am planning to write about what Maui does, as well as about my experience with hosting and versioning an open source project on both platforms. But for now, this is to be the first post on this blog.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7311448782714063415-2150507074911583382?l=maui-indexer.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://maui-indexer.blogspot.com/feeds/2150507074911583382/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://maui-indexer.blogspot.com/2009/05/maui-first-file-release.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/2150507074911583382'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7311448782714063415/posts/default/2150507074911583382'/><link rel='alternate' type='text/html' href='http://maui-indexer.blogspot.com/2009/05/maui-first-file-release.html' title='Maui - first file release!'/><author><name>Alyona</name><uri>http://www.blogger.com/profile/06542955006622791521</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='28' height='32' src='http://4.bp.blogspot.com/_dYCJl7JAgN4/SiBf-vIniBI/AAAAAAAAGRs/4rTxTVrfTwE/S220/Picture+3.png'/></author><thr:total>0</thr:total></entry></feed>
