<
previous
next >
LSI EXAMPLE - INDEXING A DOCUMENT
Putting Theory into Practice
While a discussion of the mathematics behind singular value decomposition
is beyond the scope of our article, it's worthwhile to follow the
process of creating a term-document matrix in some detail, to get
a feel for what goes on behind the scenes. Here we will process
a sample wire story to demonstrate how real-life texts get converted
into the numerical representation we use as input for our SVD algorithm.
The first step in the chain is obtaining a set of documents in
electronic form. This can be the hardest thing about LSI - there
are all too many interesting collections not yet available online.
In our experimental database, we download wire stories from an online
newspaper with an AP news feed. A script downloads each day's news
stories to a local disk, where they are stored as text files.
Let's imagine we have downloaded the following sample wire story,
and want to incorporate it in our collection:
O'Neill Criticizes Europe on Grants
PITTSBURGH (AP)
Treasury Secretary Paul O'Neill expressed
irritation Wednesday that European countries have refused
to go along with a U.S. proposal to boost the amount of direct
grants rich nations offer poor countries.
The Bush administration is pushing a plan to increase the
amount of direct grants the World Bank provides the poorest
nations to 50 percent of assistance, reducing use of loans
to these nations.
|
The first thing we do is strip all formatting from the article,
including capitalization, punctuation, and extraneous markup (like
the dateline). LSI pays no attention to word order, formatting,
or capitalization, so can safely discard that information. Our cleaned-up
wire story looks like this:
o'neill criticizes europe on grants treasury
secretary paul o'neill expressed irritation wednesday that
european countries have refused to go along with a us proposal
to boost the amount of direct grants rich nations offer poor
countries the bush administration is pushing a plan to increase
the amount of direct grants the world bank provides the poorest
nations to 50 percent of assistance reducing use of loans
to these nations
|
The next thing we want to do is pick out the content words in our
article. These are the words we consider semantically significant
- everything else is clutter. We do this by applying a stop
list of commonly used English words that don't carry semantic
meaning. Using a stop list greatly reduces the amount of noise in
our collection, as well as eliminating a large number of words that
would make the computation more difficult. Creating a stop list
is something of an art - they depend very much on the nature of
the data collection. You can see our full wire stories stop
list here.
Here is our sample story with stop-list words highlighted:
o'neill criticizes europe
on grants treasury secretary
paul o'neill expressed irritation wednesday
that european countries have
refused to go along with a
US proposal to boost the
amount of direct grants rich nations offer
poor countries the bush administration
is pushing a
plan to increase the
amount of direct grants the
world bank provides the poorest
nations to 50 percent of assistance
reducing use of loans to
these nations
|
Removing these stop words leaves us with an abbreviated version
of the article containing content words only:
o'neill criticizes europe grants treasury
secretary paul o'neill expressed irritation european countries
refused US proposal boost direct grants rich nations poor
countries bush administration pushing plan increase amount
direct grants world bank poorest nations assistance loans
nations
|
However, one more important step remains before our document is
ready for indexing. Notice how many of our content words are plural
nouns (grants, nations)
and inflected verbs (pushing, refused).
It doesn't seem very useful to have each inflected form of a content
word be listed seperately in our master word list - with all the
possible variants, the list would soon grow unwieldy. More troubling
is that LSI might not recognize that the different variant forms
were actually the same word in disguise. We solve this problem by
using a stemmer.
Stemming
While LSI itself knows nothing about language (we saw how it deals
exclusively with a mathematical vector space), some of the preparatory
work needed to get documents ready for indexing is very language-specific.
We have already seen the need for a stop list, which will vary entirely
from language to language and to a lesser extent from document collection
to document collection. Stemming is similarly language-specific,
derived from the morphology of the language. For English documents,
we use an algorithm called the Porter stemmer
to remove common endings from words, leaving behind an invariant
root form. Here are some examples of words before and after stemming:
information -> inform
presidency -> presid
presiding -> presid
happiness -> happi
happily -> happi
discouragement -> discourag
battles -> battl
And here is our sample story as it appears to the stemmer:
o'neill criticizes
europe grants treasury
secretary paul o'neill expressed
irritation european
countries refused
US proposal boost direct grants
rich nations poor countries
bush administration pushing
plan increase amount direct grants
world bank poorest nations
assistance loans
nations
|
Note that at this point we have reduced the original natural-language
news story to a series of word stems. All of the information carried
by punctuation, grammar, and style is gone - all that remains is
word order, and we will be doing away with even that by transforming
our text into a word list. It is striking that so much of the meaning
of text passages inheres in the number and choice of content words,
and relatively little in the way they are arranged. This is very
counterintuitive, considering how important grammar and writing
style are to human perceptions of writing.
Having stripped, pruned, and stemmed our text, we are left with
a flat list of words:
administrat
amount
assist
bank
boost
bush
countri (2)
direct
europ
express
grant (2)
increas
irritat
loan
nation (3)
o'neill
paul
plan
poor (2)
propos
push
refus
rich
secretar
treasuri
US
world
This is the information we will use to generate our term-document
matrix, along with a similar word list for every document in our
collection.
<
previous
next > |