Duplicate Content: Definitions and Detecting Duplicates

This portion of the collection is comprised of academic papers based on research related to defining and detecting duplicate content on the internet, including those papers that are relevant to or authored by principals and/or staff at Google. Methodologies of detection are discussed, including the factors taken into consideration for determining whether documents closely resemble one another. Also included are patents issued by the United States Patent Office, as well as applications for patents which have not yet been granted.

Finding near-replica documents on the web
Stanford paper authored by N. Shivakumar and H. Garcia-Molina

"We consider how to effciently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others."

Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content
Authored by Andrei Broder and Krishna Bharat at Compaq.

Hence, we define two hosts to be mirrors if:

A high percentage of paths (that is, the portions of the URL after the hostname) are valid on both web sites, and

These common paths link to documents that have similar content.

Therefore, hosts that replicate content but rename paths are not considered mirrors under our definition.

U. S. Patents

*** Detecting duplicate and near-duplicate files ***
U. S. Patent with Google, Inc. as assignee, invention authored by Wm. Pugh and Monika Henzinger

*** Detecting query-specific duplicate documents ***
Patent originally applied for Oct. 6, 2000 and granted to Google Sept. 2, 2003 by the U. S. Patent Office utilizes query-relevant information for similarity comparisons, in some cases relying on extracted snippets from the documents rather than the entire documents themselves.

Valuable reference:

An article by Dr. E. Garcia on duplicate content entitled "Search Engine Patents On Duplicated Content and Re-Ranking Methods" which is a synopsis of the presentation he delivered at the 2003 Search Engine Strategies conference in San Jose, CA

Method for clustering closely resembling data objects
Andrei Broder, et al,
Assignee: Digital Equipment Corporation
Date: Sept. 12, 2000

The method determines identifiable elements that indicate close resemblance between documents, and when more than a given number of elements are found that are shared, the documents are estimated to be close to identical.

Interesting points made toward the end of the Patent document are that the method can help to detect plagiarized copies of original works, even if minor changes have been made to avoid detection, and that frequency and degree of change can be tracked over periods of time.

Technique for deleting duplicate records referenced in an index of a database
Author: Burrows, Assignee: Alta Vista Company - June, 2001
Note: The Alta Vista Search Engine now belongs to Yahoo, Inc.

Method for indexing duplicate database records using a full-record fingerprint

Method and apparatus for detecting and summarizing document similarity within large document sets

Method and apparatus for identifying the existence of differences between two files

Models and Algorithms for Duplicate Document Detection
Short paper authored by Daniel Lopresti at Lucent Technologies' Bell Labs.