| < 
            previous      
            next > INTRODUCTION - THE NEED FOR SMARTER SEARCH ENGINESAs of early 2002, there were just over two billion web pages listed 
              in the Google search engine index, widely taken to be the most comprehensive. 
              No one knows how many more web pages there are on the Internet, 
              or the total number of documents available over the public network, 
              but there is no question that the number is enormous and growing 
              quickly. Every one of those web pages has come into existence within 
              the past ten years. There are web sites covering every conceivable 
              topic at every level of detail and expertise, and information ranging 
              from numerical tables to personal diaries to public discussions. 
              Never before have so many people had access to so much diverse information. 
             Even as the early publicity surrounding the Internet has died down, 
              the network itself has continued to expand at a fantastic rate, 
              to the point where the quantity of information available over public 
              networks is starting to exceed our ability to search it. Search 
              engines have been in existence for many decades, but until recently 
              they have been specialized tools for use by experts, designed to 
              search modest, static, well-indexed, well-defined data collections. 
              Today's search engines have to cope with rapidly changing, heterogenous 
              data collections that are orders of magnitude larger than ever before. 
              They also have to remain simple enough for average and novice users 
              to use. While computer hardware has kept up with these demands - 
              we can still search the web in the blink of an eye - our search 
              algorithms have not. As any Web user knows, getting reliable, relevant 
              results for an online search is often difficult. For all their problems, online search engines have come a long 
              way. Sites like Google are pioneering the use of sophisticated techniques 
              to help distinguish content from drivel, and the arms race between 
              search engines and the marketers who want to manipulate them has 
              spurred innovation. But the challenge of finding relevant content 
              online remains. Because of the sheer number of documents available, 
              we can find interesting and relevant results for any search query 
              at all. The problem is that those results are likely to be hidden 
              in a mass of semi-relevant and irrelevant information, with no easy 
              way to distinguish the good from the bad.  Precision, Ranking, and Recall - the Holy Trinity  In talking about search engines and how to improve them, it helps 
              to remember what distinguishes a useful search from a fruitless 
              one. To be truly useful, there are generally three things we want 
              from a search engine: 
             
              We want it to give us all of the relevant information available 
                on our topic.We want it to give us only information that is relevant to our 
                searchWe want the information ordered in some meaningful way, so that 
                we see the most relevant results first. The first of these criteria - getting all of the relevant information 
              available - is called recall. Without 
              good recall, we have no guarantee that valid, interesting results 
              won't be left out of our result set. We want the rate of false negatives 
              - relevant results that we never see - to be as low as possible. 
             The second criterion - the proportion of documents in our result 
              set that is relevant to our search - is called precision. 
              With too little precision, our useful results get diluted by irrelevancies, 
              and we are left with the task of sifting through a large set of 
              documents to find what we want. High precision means the lowest 
              possible rate of false positives. There is an inevitable tradeoff between precision and recall. Search 
              results generally lie on a continuum of relevancy, so there is no 
              distinct place where relevant results stop and extraneous ones begin. 
              The wider we cast our net, the less precise our result set becomes. 
              This is why the third criterion, ranking, 
              is so important. Ranking has to do with whether the result set is 
              ordered in a way that matches our intuitive understanding of what 
              is more and what is less relevant. Of course the concept of 'relevance' 
              depends heavily on our own immediate needs, our interests, and the 
              context of our search. In an ideal world, search engines would learn 
              our individual preferences so well that they could fine-tune any 
              search we made based on our past expressed interests and pecadilloes. 
              In the real world, a useful ranking is anything that does a reasonable 
              job distinguishing between strong and weak results.  The Platonic Search Engine Building on these three criteria of precision, ranking and recall, 
              it is not hard to envision what an ideal search engine might be 
              like: 
             
               
                Scope: The ideal engine would be able to search every 
                  document on the Internet 
                Speed: Results would be available immediately  
                Currency: All the information would be kept completely 
                  up-to-date 
                Recall: We could always find every document relevant 
                  to our query 
                Precision: There would be no irrelevant documents in 
                  our result set 
                Ranking: The most relevant results would come first, 
                  and the ones furthest afield would come last Of course, our mundane search engines have a way to go before reaching 
              the Platonic ideal. What will it take to bridge the gap? For the first three items in the list - scope, speed, and currency 
              - it's possible to make major improvements by throwing resources 
              at the problem. Search engines can always be made more comprehensive 
              by adding content, they can always be made faster with better hardware 
              and programming, and they can always be made more current through 
              frequent updates and regular purging of outdated information.  Improving our trinity of precision, ranking and recall, however, 
              requires more than brute force. In the following pages, we will 
              describe one promising approach, called latent 
              semantic indexing, that lets us make improvements in all 
              three categories. LSI was first developed at Bellcore in the late 
              1980's, and is the object of active research, but is surprisingly 
              little-known outside the information retrieval community. But before 
              we can talk about LSI, we need to talk a little more about how search 
              engines do what they do. 
 < 
            previous      
            next >
 |