Detecting near-replicas on the Web by content and
hyperlink analysis


Ernesto Di Iorio
Dipartimento di Ingegneria dell'Informazione
Università di Siena, Italy
Phone: +39 0577 233606
email: [email protected]

Michelangelo Diligenti
Dipartimento di Ingegneria dell'Informazione
Università di Siena, Italy
Phone: +39 0577 233606
email: [email protected]

Marco Gori
Dipartimento di Ingegneria dell'Informazione
Università di Siena, Italy
Phone: +39 0577 233610
email: [email protected]

Marco Maggini
Dipartimento di Ingegneria dell'Informazione
Università di Siena, Italy
Phone: +39 0577 233696
email: [email protected]

Augusto Pucci
Dipartimento di Ingegneria dell'Informazione
Università di Siena, Italy
Phone: +39 0577 233606
email: [email protected]

ABSTRACT

The presence of replicas or near-replicas of documents is very common on the Web. Whilst replication can improve information accessibility for the users, the presence of near-replicas can hinder the effectiveness of search engines. We propose a method to detect similar pages, in particular replicas and near-replicas, which is based on a pair of signatures. The first signature is obtained by a random projection of the bag-of-words representation of the page contents. The second signature, referred to as Hypelink Map, is computed by a recursive equation which exploits the connectivity among the Web pages to encode the page context. The experimental results show that on the given dataset replicas and near-replicas can be detected with a precision and recall of 93%.

Keywords

Web page near-replicas, duplicates detection, Random Projections, Hyperlink analysis

1. INTRODUCTION

The success of a search engine mainly depends on its capability to provide satisfactory answers to the users' queries in the first positions of the result set. In order to reduce the effect of information flooding, many criteria have been used to define a ranking among the returned documents. However, another important issue is the redundancy of the information on the Web. Documents on the Web can be replicated many times on different hosts, or the same document can be accessed by different URLs because of aliases. Partial or complete mirrors of sites are quite common to ease the access to popular resources and to distribute the network and server load. Replicas and near-replicas waste storage in the search engine indexes, they reduce the quality of the query results replicating the same information, and they can also hinder the effectiveness of the ranking techniques based on link analysis.

In [1] different techniques to detect mirror sites are analyzed and compared. The proposed algorithms do not consider the page contents, but are based on features such as the IP addresses of the sites, the hostnames, the structure of the path in the URL and the similarities in the connectivity to external hosts for the candidate mirrors. A different approach based on hash signatures of the page contents is presented in [3]. In this paper we propose a technique for finding lists of similar documents, and in particular replicas and near-replicas, based on a pair of signatures which take into account both the document contents and the hyperlink structure.

2. The Document Signatures

Each document is mapped to a pair of signatures. The Random Projection (RP) signature maps the document contents to a low-dimensional real valued vector, preserving the notion of similarity in the original space (i.e. near representations are mapped to near points in the projected space). The Hyperlink Map (HP) signature is then obtained by propagating the RP signatures through the hyperlinks between the pages, providing a method to distinguish pages also on the basis of the pages they link. Appropriate index structures for multidimensional data (e.g R-trees, etc.) can be used to reduce the cost of $n$-nearest-neighbors and range searches using the RP and HM signatures.

2.1 The Random Projection Signature

Recently Random Projection has been proposed as a technique for dimensionality reduction [4]. A random projection from $N$ to $K$ dimensions is obtained by a $K \times N$ projection matrix $\Pi$ whose entries $\pi_{ij}$ may be initialized using a uniform random distribution in [-1,+1]. The Random Projection (RP) signature $R^p$ of the page $p$ in the repository is computed from the random projection $\hat{R}^p = \Pi D^p$ of the bag-of-words representation $D^p$ of $p$. Given the vector $\hat{R}^m$ such that $\hat{R}^m_j = \min_p \hat{R}^p_j, j=1, \ldots, K$, the RP signature is obtained by normalizing the vectors $\hat{R}^p-\hat{R}^m$ in order to have each component of the signature vectors in the interval [0,1].

In the experimental evaluation we investigated the effect of the dimension $K$ showing that good accuracy can be obtained using very low values for $K$. For the dataset used in the experiments we found that no substantial improvements can be achieved by choosing $K>5$.

2.2 Hyperlink Map Signature

In order to encode the information related to the context of the page in the hyperlinked environment, we can define an iterative equation which combines the signatures of the neighbor pages, using a scheme similar to the computation of the PageRank [2]. A simple implementation of this scheme is represented by the following system of equations


\begin{displaymath}
x^p_j(t+1) = r^p_j +\gamma \cdot
\sum _{q\in ch(p)}x^q_j(t)\cdot r^q_j,  j=1,\ldots,K
\end{displaymath} (1)

where $x^p_j(t)$ is the $j$-th component of the signature for the page $p$, $r^p_j$ is the $j$-th component of the RP signature for the page $p$, $\gamma$ is a dumping factor which modulates the effect of the signature propagation from the pages linked by page $p$, and $t$ is the iteration step. At the first iteration $x^p_j(0)=r^p_j$. The vector $X^p(t)=[x^p_1(t),\ldots,x^p_K(t)]^\prime$ represents the Hyperlink Map (HM) signature of page $p$ at iteration $t$. Finally, in order to obtain values in the interval [0,1], the $K$ vectors $X^p(t)$ are multiplied by $\frac{1}{x_M}$, being $x_M = \max_{j=1,\ldots,K p=1,\ldots,N} x^p_j(t)$.

At each iteration, equation (1) propagates the HM signatures from each node to its parents, combining both the page contents, encoded by the $r^q_j$ value, and the structure of the page neighborhood. In order to include the contribution of the descendents which are $s$ links away from a given page, the equation should be iterated at least $s$ times. Significant signatures can be computed using only few iterations (3-5 iterations in the experiments).

3. Experimental Results

We collected about $300,000$ documents by crawling the Web. The bag-of-words representation of each page was based on a dictionary containing about 2 millions words. In the experiments, two documents $d^1, d^2$, having RP signatures $R^1$ and $R^2$ and HM signatures $X^1$ and $X^2$, were considered to collide with $(C,D)$ tolerance iff $\vert\vert X^1-X^2\vert\vert _\infty < C$ and $\vert\vert R^1-R^2\vert\vert _\infty < D$. Thus two documents collide if their RP and HM signatures lay in two hypercubes having side lengths smaller than $C$ and $D$, respectively.

Figure 1: Precision-Recall plots. The different curves correspond to different settings of the $C$ and $D$ thresholds. For each curve the value of one threshold is fixed while the other one is varying along the curve.
\includegraphics[width=65mm]{duplicatesResults}

We measured the accuracy of the algorithm in terms of recall and precision. The recall is defined as the number of detected (near)replicas divided by the total number of (near)replicas present in the dataset. Since it is not known a priori which documents are replicated in the dataset, we manually sampled 100 pages having at least one replica or near-replica and we used this set as a reference. Thus, the recall was computed as the fraction of the documents in this set returned by the algorithm.

The precision is defined as the number of documents that have a (near)replica in the dataset divided by the total number of documents returned by the algorithm. The precision was evaluated by checking all the lists of colliding documents returned by the algorithm using the cosine correlation of the bag-of-words representations. This algorithm is slow but highly accurate in evaluating the document similarity.

The curves in figure 1 show how the precision and the recall vary with respect to the choice of the two thresholds $C$ and $D$. The three curves depending on $C$ show the improvement provided by the use of the HM signature for three different values of $D$. The fourth curve ($C=1$) reports the behavior of the algorithm when using only the RP signature. For all values of $D$, when reducing the threshold $C$ the precision increases, while the recall is reduced. Anyway, in all the cases for appropriate values of the threshold $C$ the precision is significantly improved with a negligible reduction in the recall. This motivates the combined use of the RP and HM signatures. The break even point is 90% for the RP signature and 93% for the combination of the RP and HM signatures.

An interesting case is that of pages related to Web forums or repositories of newsgroup messages. These sites show a high redundancy because they allow different views of the same contents. The following one is an example of two URLs which refer to the same message:

ask.slashdot.org/askslashdot/02/07/16/1727256.shtml?tid=127
developers.slashdot.org/article.pl?sid=02/07/16/1727256

Another example which was detected in the dataset consists of a list of messages ordered by date, by subject, by thread or by author. Even if the different pages differ for the internal organization and for some small details, the four pages have essentially the same informative content:

mail.gnu.org/pipermail/libtool/1999-January/date.html
mail.gnu.org/pipermail/libtool/1999-January/subject.html
mail.gnu.org/pipermail/libtool/1999-January/thread.html
mail.gnu.org/pipermail/libtool/1999-January/author.html

4. Conclusions

We proposed the use of two low dimensional signatures to effectively detect replicas and near-replicas among a set of Web pages. The two signatures encode the page contents and the page hypertextual context, respectively. In particular, by using the HM signature documents are compared not only on the basis of their contents but also on their hypertextual context. The experimental results show that the use of both signatures can yield high recall and precision in finding replicas and near-replicas. This approach can be extended to broader definitions of similarity between pages than replication or near-replication.

5. REFERENCES

  1. K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society of Information Science, 51(12):1114-1122, 2000.
  2. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th World Wide Web Conference (WWW7), 14-18 Apr. 1998.
  3. J. Cho, N. Shivakumar, and H. Garcia-Molina. Finding replicated Web collections. In Proceedings of the International Conference on Management of Data (SIGMOD), pages 355-366, 2000.
  4. S. Vempala and R. I. Arriaga. An algorithmic theory of learning: robust concepts and random projection. In 40th Annual Symposium on Foundations of Computer Science, pages 616-623, 1999.



Michelangelo Diligenti 2003-03-31