Wednesday, July 8, 2009

What do subject indexing, keyphrase extraction and autotagging have in common? Terminology clarification

There has been a lot of confusion about tasks related to topic indexing. Here is an overview of these tasks, terms used to refer to them and what they stand for.
  1. Text categorization (or: text classification) - Very few general categories, like Politics or News, are assigned usually from a relatively small vocabulary.

  2. Term assignment (or: subject indexing) - Document's main topics are expressed using terms from a large vocabulary, e.g. a domain-specific thesaurus.

  3. Keyphrase extraction (or: keyword extraction, key term extraction) - Document's main topics are expressed using the most prominent words and phrases in a document.

  4. Terminology extraction (similar to back-of-the-book indexing) - All domain relevant words and phrases are extracted from a document.

  5. Full-text indexing (or: full indexing, free text indexing) - All words and phrases, sometimes excluding the stopwords, are extracted from a document.

  6. Keyphrase indexing (or: keyphrase assignment) - A general term, which refers to both term assignment and keyphrase extraction.

  7. Tagging (or: collaborative tagging, social tagging and when performed automatically: autotagging, automatic tagging) - The user defines as many topics as desired. Any word or phrase can serve as a tag. Prevalently applied on collaborative websites.

  8. Clustering is related to topic indexing in that it identifies groups of documents on the same topic; however, these groups are unlabeled.


  1. I want some opinion regarding Keyphrase Extraction ?
    (1) Keyphrase with preposition or with other POS part is useful or without prepositions or with other POS part ?

    As, in search we do not use these additional parts, but with POS part Keyphrase looks interactive.

    (2) What may be maximum suitable or acceptable length of keyphrase ? (so that it may serve all purpose i.e. searching, indexing etc.)

  2. Hi! Sorry for a late reply.
    Regarding 1) There are keyphrases which sound better with a preposition, but they are rare.
    E.g. "donation to charity", "price of anarchy".
    Sometimes the noun itself contains a preposition, e.g. "peer-to-peer system".

    2) I would say for English it's 4, but it can differ depending on the domain.

  3. What is some good clustering software? Thanks