Content Taxonomy
by Jose Saura

Introduction
We use the term Content Taxonomy to refer to the methods that help us classify data. Scientists have been classifying organisms for a long time and we can learn quite a bit from their methods.
Something that is clear is that there is no single method for classifying things. In the past scientists relied on anatomical features, for example how many leaves, are the leaves serrated or smooth, does the animal has vertebras, etc. Today they rely on DNA analysis so as technology evolves and our knowledge about the things we classify evolves so does the taxonomy.
So we need a system that allows us to support multiple, independent taxonomies and that allows us to modify the taxonomy.
When a scientist finds a plant in the middle of the forest and wants to classify it, she will go through a set rules or key that helps identify the right categories. This might be expressed in the form of questions and depending on how the question is answered then a new set of questions need to be answered, this goes one recursively until a final category is found.
The taxonomy can be represented as a tree where each node is a category that gets progressively more specific as we move towards the leaf nodes.
Animalia (Animals)
+ Chordata (Chordates)
+ Vertebrata (Vertebrates)
+ Mammalia (Mammals)
+ Theria
+ …
+ Delphinidae (dolphins, whales.../)
+Reptilia (reptiles)
+ Testudines (terrapins, tortoises, turtles)
When dealing with content we can use a similar method, that is, we will associate a leaf node of a content taxonomy tree to a particular content element. One key difference is that depending how granular our taxonomy is it might makes sense to classify something as belonging to more than one group; for example an article about Microsoft and IBM will need to be classified under both Microsoft and IBM. We might for example have a taxonomy that looks like this:
+ Business
+ US
+ Corporations
+ IBM
+ Microsoft
So in our example we will associate our article with both Business\US\Corporations\IBM and also with Business\US\Corporations\Microsoft.
Something the system needs to be able to do is to allow us to retrieve content tagged with a specific category, such as: Business\US\Corporations\Microsoft a set of categories or a category ancestor. For example: Business\US\Corporations should identify all content tagged with a leaf under this group.
As mentioned before, we need to be able to somehow modify the categorization. If at some point a company moves its headquarters overseas then we need to be able to change the category.
The modes tedious and complex part of the content classification process is to find the right leaf node(s) associated with a piece of content. Although humans are great at this task, it would be too expensive to manually classify large volumes of data.
There are algorithms available to analyze and extract entities from content and these algorithms are often tailored to specific problem or domain. Rather than using one specific algorithm we need to be able to plug in different classifiers based on specific needs. This will allow us to concentrate first on creating the right infrastructure to support content classification without relying on a specific technique.
Requirements
Based on what we covered so far, we can say that the categorization system should allow us to:
- Represent several hierarchical groups of categories.
- Modify the hierarchy at any given time
- Associate multiple categories from one or more hierarchical groups to content.
- Retrieve content based on an expression that might include one or more categories or ancestors of categories Example: Query(\art\painting AND (people\artist\Picasso OR \people\artist\Van Gogh))
- Utilize multiple independent content classifiers
- Allow for manual and/or automatic classification
Implementation
How it works in egooge:
Please refer to the diagram above that depicts the major components of the classification system.
When a document is added to the store it might or not already contain categories selected by an editor. The document storage service sends a notification to the crawler that a document has changed.
The crawler requests to the rendering pipeline a transformation of the content using an indexing specific custom XSL transform.
The XSL transform is designed specifically of this document type and it knows how to extract the relevant parts of content and which custom classifiers to invoke via XSL extension functions.
The classifiers will select specific categories from the taxonomy tree based on business rules specific for the content type/classifier.
The resulting XML document adheres to a specific schema understood by by the content index and contains among other elements the content categories form the taxonomy tree.
The Index receives the document and adds it to the index and it becomes available for the query service.
You can browse a small example of a taxonomy tree by launching the content editor and then opening the folder name taxonomy. Each node is a document whose schema is category.xsd. The schema in this example allows for the definition of synonyms and a link to an optional document that represents the entity; for example, a topic page about Microsoft could be linked from the entity \things\companies\us\Microsoft.
By establishing a relationship between content documents and these category documents in the taxonomy we can effectively categorize any form of content.