THE TERM-DOCUMENT MATRIX

Doing the Numbers

As we mentioned in our discussion of LSI, the term-document matrix is a large grid representing every document and content word in a collection. We have looked in detail at how a document is converted from its original form into a flat list of content words. We prepare a master word list by generating a similar set of words for every document in our collection, and discarding any content words that either appear in every document (such words won't let us discriminate between documents) or in only one document (such words tell us nothing about relationships across documents). With this master word list in hand, we are ready to build our TDM.

We generate our TDM by arranging our list of all content words along the vertical axis, and a similar list of all documents along the horizontal axis. These need not be in any particular order, as long as we keep track of which column and row corresponds to which keyword and document. For clarity we will show the keywords as an alphabetized list.

We fill in the TDM by going through every document and marking the grid square for all the content words that appear in it. Because any one document will contain only a tiny subset of our content word vocabulary, our matrix is very sparse (that is, it consists almost entirely of zeroes).

Here is a fragment of the actual term-document marix from our wire stories database:


Document:  a  b  c  d  e  f  g  h  i  j  k  l  m  n  o  p  q  r   { 3000 more columns }     


aa               0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  ...  
amotd       0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  ...  
aaliyah      0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  ...  
aarp          0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  ...  
ab              0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  ...  
...
zywicki    0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  ...

We can easily see if a given word appears in a given document by looking at the intersection of the appropriate row and column. In this sample matrix, we have used ones to represent document/keyword pairs. With such a binary scheme, all we can tell about any given document/keyword combination is whether the keyword appears in the document.

This approach will give acceptable results, but we can significantly improve our results by applying a kind of linguistic favoritism called term weighting to the value we use for each non-zero term/document pair.

Not all Words are Created Equal

Term weighting is a formalization of two common-sense insights:

Content words that appear several times in a document are probably more meaningful than content words that appear just once.
Infrequently used words are likely to be more interesting than common words.

The first of these insights applies to individual documents, and we refer to it as local weighting. Words that appear multiple times in a document are given a greater local weight than words that appear once. We use a formula called logarithmic local weighting to generate our actual value.

The second insight applies to the set of all documents in our collection, and is called global term weighting. There are many global weighting schemes; all of them reflect the fact that words that appear in a small handful of documents are likely to be more significant than words that are distributed widely across our document collection. Our own indexing system uses a scheme called inverse document frequency to calculate global weights.

By way of illustration, here are some sample words from our collection, with the number of documents they appear in, and their corresponding global weights.

word     count     global weight



unit       833     1.44
cost       295     2.47
project    169     3.03
tackle     40      4.47
wrestler   7       6.22

You can see that a word like wrestler, which appears in only seven documents, is considered twice as significant as a word like project, which appears in over a hundred.

There is a third and final step to weighting, called normalization. This is a scaling step designed to keep large documents with many keywords from overwhelming smaller documents in our result set. It is similar to handicapping in golf - smaller documents are given more importance, and larger documents are penalized, so that every document has equal significance.

These three values multiplied together - local weight, global weight, and normalization factor - determine the actual numerical value that appears in each non-zero position of our term/document matrix.

Although this step may appear language-specific, note that we are only looking at word frequencies within our collection. Unlike the stop list or stemmer, we don't need any outside source of linguistic information to calculate the various weights. While weighting isn't critical to understanding or implementing LSI, it does lead to much better results, as it takes into account the relative importance of potential search terms.

The Moment of Truth

With the weighting step done, we have done everything we need to construct a finished term-document matrix. The final step will be to run the SVD algorithm itself. Notice that this critical step will be purely mathematical - although we know that the matrix and its contents are a shorthand for certain linguistic features of our collection, the algorithm doesn't know anything about what the numbers mean. This is why we say LSI is language-agnostic - as long as you can perform the steps needed to generate a term-document matrix from your data collection, it can be in any language or format whatsoever.

You may be wondering what the large matrix of numbers we have created has to do with the term vectors and many-dimensional spaces we discussed in our earlier explanation of how LSI works. In fact, our matrix is a convenient way to represent vectors in a high-dimensional space. While we have been thinking of it as a lookup grid that shows us which terms appear in which documents, we can also think of it in spatial terms. In this interpretation, every column is a long list of coordinates that gives us the exact position of one document in a many-dimensional term space. When we applied term weighting to our matrix in the previous step, we nudged those coordinates around to make the document's position more accurate.

As the name suggests, singular value decomposition breaks our matrix down into a set of smaller components. The algorithm alters one of these components ( this is where the number of dimensions gets reduced ), and then recombines them into a matrix of the same shape as our original, so we can again use it as a lookup grid. The matrix we get back is an approximation of the term-document matrix we provided as input, and looks much different from the original:
a b c d e f g h i j k

aa -0.0006 -0.0006 0.0002 0.0003 0.0001 0.0000 0.0000 -0.0001 0.0007 0.0001 0.0004 ...
amotd -0.0112 -0.0112 -0.0027 -0.0008 -0.0014 0.0001 -0.0010 0.0004 -0.0010 -0.0015 0.0012 ...
aaliyah -0.0044 -0.0044 -0.0031 -0.0008 -0.0019 0.0027 0.0004 0.0014 -0.0004 -0.0016 0.0012 ...
aarp 0.0007 0.0007 0.0004 0.0008 -0.0001 -0.0003 0.0005 0.0004 0.0001 0.0025 0.0000 ...
ab -0.0038 -0.0038 0.0027 0.0024 0.0036 -0.0022 0.0013 -0.0041 0.0010 0.0019 0.0026 ...
...
zywicki -0.0057 0.0020 0.0039 -0.0078 -0.0018 0.0017 0.0043 -0.0014 0.0050 -0.0020 -0.0011 ...

Notice two interesting features in the processed data:

The matrix contains far fewer zero values. Each document has a similarity value for most content words.
Some of the similarity values are negative. In our original TDM, this would correspond to a document with fewer than zero occurences of a word, an impossibility. In the processed matrix, a negative value is indicative of a very large semantic distance between a term and a document.

This finished matrix is what we use to actually search our collection. Given one or more terms in a search query, we look up the values for each search term/document combination, calculate a cumulative score for every document, and rank the documents by that score, which is a measure of their similarity to the search query. In practice, we will probably assign an empirically-determined threshold value to serve as a cutoff between relevant and irrelevant documents, so that the query does not return every document in our collection.

The Big Picture

Now that we have looked at the details of latent semantic indexing, it is instructive to step back and examine some real-life applications of LSI. Many of these go far beyond plain search, and can assume some surprising and novel guises. Nevertheless, the underlying techniques will be the same as the ones we have outlined here.

< previous next >