Phrase Based Indexing, Phrase Based Information Retrieval

The latest Google news that's hitting recently is a series of patent applications that was filed with the U.S. Patent Office not too long ago that comprise a related "cluster" relating to phrase based indexing and retrieval. They're listed here (with the exception of one, which will be added), with the addition of an important one that was added after the original set and addresses the issue of search indexing by of use of a primary and secondary index. In addition, it gives a little clearer, simpler overview of phrase-based concepts.

I'll also be adding references to a couple of other related and preceding technologies that can help to give a clearer picture of the older foundational principles that apply, along with some personal notes. Other suggested reading is the page with references to papers on keyword co-occurrence, a term used repeatedly in these patent applications.

Patent Applications

With the exception of the latest being listed first, the others are listed in logical sequence.

Multiple index based information retrieval system
United States Patent Application 20060106792
Published: May18, 2006
Filed: January 25, 2005
Inventor: Anna Lynn Patterson

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. The document index is partitioned into multiple indexes, including a primary index and a secondary index. The primary index stores phrase posting lists with relevance rank ordered documents. The secondary index stores excess documents from the posting lists in document order.


Phrase identification in an information retrieval system

United States Patent Application 20060018551
Filed: July 26, 2004
Inventor: Anna Lynn Patterson

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.


Phrase Based Indexing in an Information Retrieval System

U. S. Patent Application 20060020607
Filed: July 26, 2004
Inventor: Anna Lynn Patterson

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.


Phrase-based searching in an information retrieval system

United States Patent Application 20060031195
Filed: July 26, 2004
Inventor: Anna Lynn Patterson

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.


Phrase-based generation of document descriptions

United States Patent Application 20060020571
Filed: July 26, 2004
Inventor: Anna Lynn Patterson

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.


Detecting spam documents in a phrase based information retrieval system
United States Patent Application 20060294155
Published: December 28, 2006
Filed: June 28, 2006
Inventor: Anna Lynn Patterson

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. A spam document is identified based on the number of related phrases included in a document.


Efficient Phrase Based Document Indexing for Document Clustering
Examines clustering by phrases rather than by individual words.

Other Resources

Phrase Based Information Retrieval and Spam Detection
Bill Slawski's article at SEO by the Sea

Using WordNet in a Knowledge-Based Approach to Information Retrieval (PDF)
Abstract and references: Citeseer

More to come. :-)