At workshops held at prestigious AI conferences on Wikipedia-related research, e.g. WikiAI at IJCAI and People's Web meets NLP at ACL, I have learned about pretty amazing things one can implement using Wikipedia data. From computing semantic relatedness between words at a level comparable to humans to converting folksonomies of tags into ontologies. In general, the organizers make a differentiation between how Wikipedia can help AI in solving language-related tasks and how AI can help Wikipedia to improve its quality and fight vandalism.
Let's concentrate on the former.
A huge barrier here is that Wikipedia is huge (and gets bigger!). There are tools available for processing the Wikipedia dumps, but the best working I found so far is the open source Java library Wikipedia Miner. Not just because it was developed at the University of Waikato, where I studied. The reasons are its easy installation, an intuitive object-oriented model for accessing Wikipedia data, as well as additional tools for computing similarity between any two English phrases, wikifying any text (i.e. link its phrases to concepts explained in Wikipedia) and even implemented web services. Check out the online demos:
- Search. If you search for a word like palm, it will list all the possible meanings, starting with the most likely one (Arecaceae - 65% likelihood) and all other meanings, like Palm (PDA) and hand palm. Clicking on a meaning shows the words used to refer to it, e.g. redirects like Palmtree and anchors like palm wood, as well as translations (language links) and broader terms (categories).
- Compare. Here you can calculate things like vodka is more related to wine than to water.
- Wikify. This feature helps finding out what concepts are discussed in any text or on any website. Very practical particularly for texts with many named entities, e.g. news articles, but not only. Here is this blog wikified with Wikipedia Miner (the links are added in blue and at the highest possible density).
Many other people already are using Wikipedia Miner (100 downloads in the last 6 month). It has also been referenced as research tool in various published projects, including adding structure to search queries, finding similar videos, extending ontologies, creating language learning tools and many more.