STRUCTURED DATA - EVERYTHING IN ITS PLACE
The Joys of Taxonomy
Anyone who has ever used a card catalog or online library terminal
is familiar with structured data. Rather than indexing the full
text of every book, article, and document in a large collection,
works are assigned keywords by an archivist, who also categorizes
them within a fixed hierarchy. A search for the keywords Khazar
empire, for example, might yield several titles under the
category Khazars - Ukraine - Kiev - History,
while a search for beet farming
might return entries under Vegetables -
Postharvest Diseases and Injuries - Handbooks, Manuals, etc..
The Library of Congress is a good example of this kind of comprehensive
classification - each work is assigned keywords from a rigidly constrained
vocabulary, then given a unique identifier and placed into one or
more categories to facilitate later searching.
While most library collections do not feature full-text search
(since so few works in print are available in electronic form),
there is no reason why structured databases can't also include a
full-text search. Many early web search engines, including Yahoo,
used just such an approach, with human archivists reviewing each
page and assigning it to one or more categories before including
it in the search engine's document collection.
The advantage of structured data is that it allows users to refine
their search using concepts rather than just individual keywords
or phrases. If we are more interested in politics than mountaineering,
it is very helpful to be able to limit a search for Geneva
summit to the category Politics-International-20th
Century, rather than Switzerland-Geography.
And once we get our result, we can use the classifiers to browse
within a category or sub-category for other results that may be
conceptually similar, such as Rejkyavik
summit or SALT II talks,
even if they don't contain the keyword Geneva.
You Say Vegetables::Tomato, I Say Fruits::Tomato
We can see how assigning descriptors and classifiers to a text
gives us one important advantage, by returning relevant documents
that don't necessarily contain a verbatim match to our search query.
Fully described data sets also give us a view of the 'big picture'
- by examining the structure of categories and sub-categories (or
taxonomy), we can form a rough image
of the scope and distribution of the document collection as a whole.
But there are serious drawbacks to this approach to categorizing
data. For starters, there are the problems inherent in any kind
of taxonomy. The world is a fuzzy place that sometimes resists categorization,
and putting names to things can constrain the ways in which we view
them. Is a tomato a fruit or a vegetable? The answer depends on
whether you are a botanist or a cook. Serbian and Croatian are mutually
intelligible, but have different writing systems and are spoken
by different populations with a dim view of one another. Are they
two different languages? Russian and Polish have two words for 'blue',
where English has one. Which is right? Classifying something inevitably
colors the way in which we see it.
Moreover, what happens if I need to combine two document collections
indexed in different ways? If I have a large set of articles about
Indian dialects indexed by language family, and another large indexed
by geographic region, I either need to choose one taxonomy over
the other, or combine the two into a third. In either case I will
be re-indexing a lot of the data. There are many efforts underway
to mitigate this problem - ranging from standards-based approaches
like Dublin Core to rarefied
research into ontological taxonomies
(finding a sort of One True Path to classifying data). Nevertheless,
the underlying problem is a thorny one.
One common-sense solution is to classify things in multiple ways
- assigning a variety of categories, keywords, and descriptors to
every document we want to index. But this runs us into the problem
of limited resources. Having an expert archivist review and classify
every document in a collection is an expensive undertaking, and
it grows more expensive and time-consuming as we expand our taxonomy
and keyword vocabulary. What's more, making changes becomes more
expensive. Remember that if we want to augment or change our taxonomy
(as has actually happened with several large tagged linguistic corpora),
there is no recourse except to start from the beginning. And if
any document gets misclassified, it may never be seen again.
Simple schemas may not be descriptive enough to be useful, and
complex schemas require many thousands of hours of expert archivist
time to design, implement, and maintain. Adding documents to a collection
requires more expert time. For large collections, the effort becomes
Better Living Through Matrix Algebra
So far the choice seems pretty stark - either we live with amorphous
data that we can only search by keyword, or we adopt a regimented
approach that requires enormous quantities of expensive skilled
user time, filters results through the lens of implicit and explicit
assumptions about how the data should be organized, and is a chore
to maintain. The situation cries out for a middle ground, some way
to at least partially organize complex data without human intervention
in a way that will be meaningful to human users. Fortunately for
us, techniques exist to do just that.