<
previous
next >
THE TERM-DOCUMENT MATRIX
Doing the Numbers
As we mentioned in our discussion of LSI, the term-document
matrix is a large grid representing every document and content
word in a collection. We have looked in detail at how a document
is converted from its original form into a flat list of content
words. We prepare a master word list by generating a similar set
of words for every document in our collection, and discarding any
content words that either appear in every document (such words won't
let us discriminate between documents) or in only one document (such
words tell us nothing about relationships across documents). With
this master word list in hand, we are ready to build our TDM.
We generate our TDM by arranging our list of all content words
along the vertical axis, and a similar list of all documents along
the horizontal axis. These need not be in any particular order,
as long as we keep track of which column and row corresponds to
which keyword and document. For clarity we will show the keywords
as an alphabetized list.
We fill in the TDM by going through every document and marking
the grid square for all the content words that appear in it. Because
any one document will contain only a tiny subset of our content
word vocabulary, our matrix is very sparse (that is, it consists
almost entirely of zeroes).
Here is a fragment of the actual term-document marix from our wire
stories database:
Document: a b c d e f g h i j k l m n o p q r { 3000 more columns }
aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
amotd 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
aaliyah 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
aarp 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 ...
ab 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
...
zywicki 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
We can easily see if a given word appears in a given document by
looking at the intersection of the appropriate row and column. In
this sample matrix, we have used ones to represent document/keyword
pairs. With such a binary scheme, all we can tell about any given
document/keyword combination is whether the keyword appears in the
document.
This approach will give acceptable results, but we can significantly
improve our results by applying a kind of linguistic favoritism
called term weighting to the value
we use for each non-zero term/document pair.
Not all Words are Created Equal
Term weighting is a formalization of two common-sense insights:
-
Content words that appear several times in a document are probably
more meaningful than content words that appear just once.
-
Infrequently used words are likely to be more interesting than
common words.
The first of these insights applies to individual documents, and
we refer to it as local weighting.
Words that appear multiple times in a document are given a greater
local weight than words that appear once. We use a formula called
logarithmic local weighting to generate
our actual value.
The second insight applies to the set of all documents in our collection,
and is called global term weighting.
There are many global weighting schemes; all of them reflect the
fact that words that appear in a small handful of documents are
likely to be more significant than words that are distributed widely
across our document collection. Our own indexing system uses a scheme
called inverse document frequency to
calculate global weights.
By way of illustration, here are some sample words from our collection,
with the number of documents they appear in, and their corresponding
global weights.
word count global weight
unit 833 1.44
cost 295 2.47
project 169 3.03
tackle 40 4.47
wrestler 7 6.22
You can see that a word like wrestler,
which appears in only seven documents, is considered twice as significant
as a word like project, which appears
in over a hundred.
There is a third and final step to weighting, called normalization.
This is a scaling step designed to keep large documents with many
keywords from overwhelming smaller documents in our result set.
It is similar to handicapping in golf - smaller documents are given
more importance, and larger documents are penalized, so that every
document has equal significance.
These three values multiplied together - local weight, global weight,
and normalization factor - determine the actual numerical value
that appears in each non-zero position of our term/document matrix.
Although this step may appear language-specific, note that we are
only looking at word frequencies within our collection. Unlike the
stop list or stemmer, we don't need any outside source of linguistic
information to calculate the various weights. While weighting isn't
critical to understanding or implementing LSI, it does lead to much
better results, as it takes into account the relative importance
of potential search terms.
The Moment of Truth
With the weighting step done, we have done everything we need to
construct a finished term-document matrix. The final step will be
to run the SVD algorithm itself. Notice that this critical step
will be purely mathematical - although we know that the matrix and
its contents are a shorthand for certain linguistic features of
our collection, the algorithm doesn't know anything about what the
numbers mean. This is why we say LSI is language-agnostic - as long
as you can perform the steps needed to generate a term-document
matrix from your data collection, it can be in any language or format
whatsoever.
You may be wondering what the large matrix of numbers we have created
has to do with the term vectors and many-dimensional spaces we discussed
in our earlier explanation of how LSI works. In fact, our matrix
is a convenient way to represent vectors in a high-dimensional space.
While we have been thinking of it as a lookup grid that shows us
which terms appear in which documents, we can also think of it in
spatial terms. In this interpretation, every column is a long list
of coordinates that gives us the exact position of one document
in a many-dimensional term space. When we applied term weighting
to our matrix in the previous step, we nudged those coordinates
around to make the document's position more accurate.
As the name suggests, singular value decomposition breaks our matrix
down into a set of smaller components. The algorithm alters one
of these components ( this is where the number of dimensions gets
reduced ), and then recombines them into a matrix of the same shape
as our original, so we can again use it as a lookup grid. The matrix
we get back is an approximation of the term-document matrix we provided
as input, and looks much different from the original:
a b c d e f g h i j k
aa -0.0006 -0.0006 0.0002
0.0003 0.0001 0.0000 0.0000 -0.0001 0.0007 0.0001 0.0004 ...
amotd -0.0112 -0.0112 -0.0027
-0.0008 -0.0014 0.0001 -0.0010 0.0004 -0.0010 -0.0015 0.0012 ...
aaliyah -0.0044 -0.0044 -0.0031
-0.0008 -0.0019 0.0027 0.0004 0.0014 -0.0004 -0.0016 0.0012 ...
aarp 0.0007 0.0007 0.0004
0.0008 -0.0001 -0.0003 0.0005 0.0004 0.0001 0.0025 0.0000 ...
ab -0.0038 -0.0038 0.0027
0.0024 0.0036 -0.0022 0.0013 -0.0041 0.0010 0.0019 0.0026 ...
...
zywicki -0.0057 0.0020 0.0039
-0.0078 -0.0018 0.0017 0.0043 -0.0014 0.0050 -0.0020 -0.0011 ...
Notice two interesting features in the processed data:
-
The matrix contains far fewer zero values. Each document has
a similarity value for most content words.
-
Some of the similarity values are negative. In our original
TDM, this would correspond to a document with fewer than zero
occurences of a word, an impossibility. In the processed matrix,
a negative value is indicative of a very large semantic distance
between a term and a document.
This finished matrix is what we use to actually search our collection.
Given one or more terms in a search query, we look up the values
for each search term/document combination, calculate a cumulative
score for every document, and rank the documents by that score,
which is a measure of their similarity to the search query. In practice,
we will probably assign an empirically-determined threshold
value to serve as a cutoff between relevant and irrelevant
documents, so that the query does not return every document in our
collection.
The Big Picture
Now that we have looked at the details of latent semantic indexing,
it is instructive to step back and examine some real-life applications
of LSI. Many of these go far beyond plain search, and can assume
some surprising and novel guises. Nevertheless, the underlying techniques
will be the same as the ones we have outlined here.
<
previous
next >
|