Duplicate Content: Definitions and Detecting Duplicates
This portion of the collection is comprised of academic papers
based on research related to defining and detecting duplicate content
on the internet, including those papers that are relevant to or
authored by principals and/or staff at Google.
Methodologies of detection are discussed, including the factors
taken into consideration for determining whether documents closely
resemble one another. Also included are patents issued by the United
States Patent Office, as well as applications for patents which
have not yet been granted.
Finding
near-replica documents on the web
Stanford paper authored by N. Shivakumar and H. Garcia-Molina
"We consider how to effciently compute the overlap between
all pairs of web documents. This information can be used to improve
web crawlers, web archivers and in the presentation of search
results, among others."
Mirror,
Mirror on the Web: A Study of Host Pairs with Replicated Content
Authored by Andrei Broder and Krishna Bharat at Compaq.
Hence, we define two hosts to be mirrors if:
- A high percentage of paths (that is, the portions of the
URL after the hostname) are valid on both web sites, and
- These common paths link to documents that have similar
content.
Therefore, hosts that replicate content but rename paths are
not considered mirrors under our definition.
U. S. Patents
*** Detecting
duplicate and near-duplicate files ***
U. S. Patent with Google, Inc. as assignee, invention authored
by Wm. Pugh and Monika Henzinger
*** Detecting
query-specific duplicate documents ***
Patent originally applied for Oct. 6, 2000 and granted to Google
Sept. 2, 2003 by the U. S. Patent Office utilizes query-relevant
information for similarity comparisons, in some cases relying on
extracted snippets from the documents rather than the entire documents
themselves.
Valuable reference:
An article by Dr. E. Garcia on duplicate
content entitled "Search Engine Patents On Duplicated
Content and Re-Ranking Methods" which is a synopsis of
the presentation he delivered at the 2003 Search Engine Strategies
conference in San Jose, CA
Method
for clustering closely resembling data objects
Andrei Broder, et al,
Assignee: Digital Equipment Corporation
Date: Sept. 12, 2000
The method determines identifiable elements that indicate close
resemblance between documents, and when more than a given number
of elements are found that are shared, the documents are estimated
to be close to identical.
Interesting points made toward the end of the Patent document are
that the method can help to detect plagiarized copies of original
works, even if minor changes have been made to avoid detection,
and that frequency and degree of change can be tracked over periods
of time.
Technique
for deleting duplicate records referenced in an index of a database
Author: Burrows, Assignee: Alta Vista Company - June, 2001
Note: The Alta Vista Search Engine now belongs to Yahoo,
Inc.
Method
for indexing duplicate database records using a full-record fingerprint
Method
and apparatus for detecting and summarizing document similarity
within large document sets
Method
and apparatus for identifying the existence of differences between
two files
Models
and Algorithms for Duplicate Document Detection
Short paper authored by Daniel Lopresti at Lucent Technologies'
Bell Labs.
|