< previous     next >

STRUCTURED DATA - EVERYTHING IN ITS PLACE

The Joys of Taxonomy

Anyone who has ever used a card catalog or online library terminal is familiar with structured data. Rather than indexing the full text of every book, article, and document in a large collection, works are assigned keywords by an archivist, who also categorizes them within a fixed hierarchy. A search for the keywords Khazar empire, for example, might yield several titles under the category Khazars - Ukraine - Kiev - History, while a search for beet farming might return entries under Vegetables - Postharvest Diseases and Injuries - Handbooks, Manuals, etc.. The Library of Congress is a good example of this kind of comprehensive classification - each work is assigned keywords from a rigidly constrained vocabulary, then given a unique identifier and placed into one or more categories to facilitate later searching.

While most library collections do not feature full-text search (since so few works in print are available in electronic form), there is no reason why structured databases can't also include a full-text search. Many early web search engines, including Yahoo, used just such an approach, with human archivists reviewing each page and assigning it to one or more categories before including it in the search engine's document collection.

The advantage of structured data is that it allows users to refine their search using concepts rather than just individual keywords or phrases. If we are more interested in politics than mountaineering, it is very helpful to be able to limit a search for Geneva summit to the category Politics-International-20th Century, rather than Switzerland-Geography. And once we get our result, we can use the classifiers to browse within a category or sub-category for other results that may be conceptually similar, such as Rejkyavik summit or SALT II talks, even if they don't contain the keyword Geneva.

You Say Vegetables::Tomato, I Say Fruits::Tomato

We can see how assigning descriptors and classifiers to a text gives us one important advantage, by returning relevant documents that don't necessarily contain a verbatim match to our search query. Fully described data sets also give us a view of the 'big picture' - by examining the structure of categories and sub-categories (or taxonomy), we can form a rough image of the scope and distribution of the document collection as a whole.

But there are serious drawbacks to this approach to categorizing data. For starters, there are the problems inherent in any kind of taxonomy. The world is a fuzzy place that sometimes resists categorization, and putting names to things can constrain the ways in which we view them. Is a tomato a fruit or a vegetable? The answer depends on whether you are a botanist or a cook. Serbian and Croatian are mutually intelligible, but have different writing systems and are spoken by different populations with a dim view of one another. Are they two different languages? Russian and Polish have two words for 'blue', where English has one. Which is right? Classifying something inevitably colors the way in which we see it.

Moreover, what happens if I need to combine two document collections indexed in different ways? If I have a large set of articles about Indian dialects indexed by language family, and another large indexed by geographic region, I either need to choose one taxonomy over the other, or combine the two into a third. In either case I will be re-indexing a lot of the data. There are many efforts underway to mitigate this problem - ranging from standards-based approaches like Dublin Core to rarefied research into ontological taxonomies (finding a sort of One True Path to classifying data). Nevertheless, the underlying problem is a thorny one.

One common-sense solution is to classify things in multiple ways - assigning a variety of categories, keywords, and descriptors to every document we want to index. But this runs us into the problem of limited resources. Having an expert archivist review and classify every document in a collection is an expensive undertaking, and it grows more expensive and time-consuming as we expand our taxonomy and keyword vocabulary. What's more, making changes becomes more expensive. Remember that if we want to augment or change our taxonomy (as has actually happened with several large tagged linguistic corpora), there is no recourse except to start from the beginning. And if any document gets misclassified, it may never be seen again.

Simple schemas may not be descriptive enough to be useful, and complex schemas require many thousands of hours of expert archivist time to design, implement, and maintain. Adding documents to a collection requires more expert time. For large collections, the effort becomes prohibitive.

Better Living Through Matrix Algebra

So far the choice seems pretty stark - either we live with amorphous data that we can only search by keyword, or we adopt a regimented approach that requires enormous quantities of expensive skilled user time, filters results through the lens of implicit and explicit assumptions about how the data should be organized, and is a chore to maintain. The situation cries out for a middle ground, some way to at least partially organize complex data without human intervention in a way that will be meaningful to human users. Fortunately for us, techniques exist to do just that.



< previous     next >