page rank for a web search engin


when a search is made on the internet us a search engin there is
first a tradit text process part, where the aim is to find all
the web page contain the word of the queri.


due to the massiv size of the web, the number of hit is like to be
much too larg to be handl by the user. 


therefor, some measur of
qualiti is need to filter awai page that ar like to be
less interest.


when on us a web search engin
 it is typic that the  search phrase is  under-specifi. 


a googl http://www.googl.com/ search conduct on
septemb  29, 2005, us the search phrase univers,  gave
as a result link to the follow   well-known univers: 
harvard,   stanford,  cambridg, yale, cornel,  oxford. 


the
total number of web page relev to the search phrase wa more than
2 billion. 


 obvious googl us an  algorithm for rank all the web
 page that agre rather well with a common-sens qualiti measur. 

 somewhat surprisingli, the rank procedur is not base on human
 judgement, but on the link structur of the web.
 

 loos speak, googl assign a high rank to a web page, if it ha
 inlink from other page that have a high rank. 


we will see that thi
 ``self-referenc'' statement can be formul mathemat as an
 eigenvalu equat for a certain  matrix. 


pagerank


it is of cours imposs to defin a gener valid measur of
relev that would be accept for a major of user of a search
engin. 


googl us the concept of pagerank as a qualiti measur
of web page. 


 it is base on the assumpt that the number of link
to and from a page give inform about the import of a page. 


we
will give a descript of pagerank base primarili on 
and . 


concern googl, see .


let all web page be order from 1 to , and let  be a
particular web page. 


then    will denot the  set of page that
 is link to, the outlink.
 

the number of outlink is denot 
the set of inlink,
denot , ar the page that have an outlink to . 


in gener, a page  can be consid as more import the more
inlink it ha. 


howev, a rank system base onli
on the number of inlink is easi to manipul


 for an exampl of attempt to fool a search engin, see


 when you design a web page  that (e.g. for commerci reason)
you would like to be seen by as mani as possibl, you could simpli
creat a larg number of (inform-less and unimport) page that
have outlink to . 


in order to discourag thi, on  defin the
rank of  in such a wai that if a highli rank page ,
ha an outlink to , thi add to the import
of  in the follow wai: 

the rank of page  is a weight sum
of the rank of the page that have outlink to . 


the weight is
such that the rank of a page  is divid evenli among it outlink. 


the  preliminari definit of pagerank is


thi definit  is recurs, so pagerank cannot be comput
directli. 


instead a fix point iter might be us: guess an
initi rank vector . 


 then iter


there ar a few problem with thi iter: if a page ha no
outlink, then in the iter process it onli accumul rank via
it inlink, but thi rank is never distribut further. 


therefor it is not clear if  the iter converg. 


 we will
come back to thi question later. 


more insight is gain if we reformul  eq:pr1 as an
eigenvalu problem for a matrix repres the graph of the
internet. 


let  be a squar matrix of dimens . 


then defin


thi definit mean that row  ha nonzero element in those
posit that correspond to inlink of . 


similarli, column 
ha nonzero element equal to  in those posit that correspond
to the outlink of , and, provid that the page ha outlink, the
sum of all the element in column  is equal to on. 


 in the
follow symbol pictur of the matrix , non-zero element ar
denot :


the follow link graph
  illustr a set of web page with outlink
and inlink.


the correspond matrix becom


sinc page 4 ha no outlink,  the correspond column
is equal to zero. 


  obvious,  the definit
eq:pr1 is equival to  the scalar product of row  and the
vector , which hold the rank of all page. 


 we can write the
equat in matrix form: 
i.e.  is an eigenvector of  with eigenvalu . 
  

it is now easili seen that the iter eq:pow1 is equival
to 
which is   the power method   for comput
the eigenvector. 


howev, at thi point it is not clear that  pagerank
is well-defin, as we do not know if  there exist an eigenvalu
equal to 1. 


it turn out that the theori of markov chain is us in
the analysi. 


random walk and markov chain


there is a random walk
 

 interpret of the pagerank concept. 


assum that a surfer visit
 a web page, choos the next page among the outlink with equal
 probabl. then the random walk induc a markov chain, see e.g.


   a
 markov chain is a random process, where the next state is determin
 complet from the present state; the process ha no memori


the  transit matrix of the markov chain is  . 
  

(note that we us a slightli differ notat than is
 common  in  the theori of stochast process).


the random surfer should never get stuck. 


in other word, our random
walk model should have no page without outlink (such a page
correspond to a zero column in ). 


therefor the model is modifi
so that zero column ar replac by a constant valu in all
posit. 


thi mean  that there is equal probabl to go to ani
other page in the net. 


defin the vector
for , and


 the modifi matrix is defin 


with thi modif the matrix  is a proper column-stochast
matrix:
  

 it ha non-neg element and the element of each column
sum up to 1. 


the preced statement can be reformul as follow.


a column-stochast matrix  satisfi 
where  is defin by eq:edef.


the matrix in the previou exampl is modifi to


now, in analog to eq:preig, we would like to defin the
pagerank vector as a uniqu eigenvector of  with eigenvalu 1, 


howev, the exist of such a uniqu eigenvalu is still not
guarante. 


the eigenvector of the transit matrix correspond to a stationari
probabl distribut for the markov chain: 


the element in posit
, , is the probabl that after a larg number of step, the
random walker is at web page . 


to ensur the exist
of a  uniqu distribut, the matrix must be irreduc,
cf. .  


a squar matrix  is call 
      reduc 
if there is a permut matrix  such that 
where  and  ar both squar. 


otherwis the matrix is call 
  irreduc., 
  

to illustr the concept of reduc,  we give an exampl of a
link graph that correspond to a reduc matrix. 


a random walker who ha enter the left part of the link graph will
never get out of it, and similarli  he will get stuck in  the right
part. 


the correspond matrix is 
which is of the form eq:reduc.


actual, thi matrix ha two eigenvalu equal to 1, and on equal to
-1, see exampl  below.


the  direct graph correspond to an irreduc matrix
is strongli connect: given
ani two node , in the graph, there  exist a path
lead from  to .


the uniqu of the largest eigenvalu of an irreduc matrix is
guarante by the 
perron-frobeniu theorem;
  

 we state it for the special case treat
here. 


the inequ   is understood as all the element of 
be strictli posit. 


let  be an irreduc column-stochast matrix. 


 the
largest eigenvalu in magnitud is equal to 1. 


there is a uniqu
correspond eigenvector  satisfi , and ;
thi is the onli eigenvector that is non-neg. 


if , then 
.


due to the fact that  is column-stochast we have ,
which mean that 1 is an eigenvalu of . 


the rest of the statement
can be prove us perron-frobeniu theori 
. 


given the size of the internet, we can be sure that the
link matrix  is reduc, which mean that the pagerank eigenvector  of
 is not well-defin. 


to ensur irreduc, i.e. to make it imposs for the
random walker to get trap in a subgraph, on add, artifici,  a
link from everi web page to all the other. 


in matrix term, thi can
be made by take a convex combin of   and a rank on matrix,
for some  satisfi . 


it is easi to see
that  the matrix  is   column-stochast:


 the random walk interpret of the addit rank on
term is that in each time step the surfer visit a page will jump to
a random page with probabl 
(sometim refer to as teleport).  


 we now see that  the pagerank vector for the matrix  is
well-defin. 


  the column-stochast matrix  defin in eq:a is
  irreduc (sinc ) and ha the largest in magnitud
  eigenvalu . 


the correspond eigenvector  satisfi
  . 


for the converg of the numer eigenvalu algorithm, it is
essenti to know, how the eigenvalu of  ar chang by the rank
on modif eq:a. 


[]

assum that the eigenvalu of the column-stochast matrix  ar
. 


then the eigenvalu of 
 ar 
.


defin  to be 
normal to euclidean length 1, and let 
be such that  is
orthogon. 


then, sinc  
where , and . 


sinc we have
made a similar transform, the matrix  ha the eigenvalu 
. 


we further have 
 

therefor,


the statement now follow immedi.


theorem  impli  that even if  ha a multipl
eigenvalu equal to 1, 
which is actual the case for the googl matrix, the second largest
eigenvalu in magnitud of  is alwai equal to . 


we comput the eigenvalu and eigenvector of the matrix 
, with  from
eq:preduc and  . 


the matlab code
give the follow result:


it is seen that all other eigenvector except the first on (which
correspond to the eigenvalu 1),
have both posit and neg compon, as state in theorem
. 


instead of the modif  eq:a we can defin 
where  is a  non-neg 
vector  with   that can be chosen in order to make the
search bias toward certain 
kind of web page. 


therefor, it is often refer to as a 
person vector . 


the vector  can also be us for avoid manipul by so call
link farm . 


the power method for pagerank comput


we want to solv the
eigenvalu problem
where  is normal . 


in thi section we denot
the sought eigenvector by 
. 


in deal with stochast
matric and vector that ar probabl distribut, it is natur
to us the 1-norm for vector (section 
). 


 due to the sparsiti and the dimens of 
(estim to be of the order billion), it
is out of the question to comput the eigenvector us ani of the
standard method for dens matric that ar base on appli
orthogon transform to the matrix describ in chapter
. 


the onli viabl method so 
far is the power method. 


assum that an initi approxim  is given. 


the power
method is given in the follow algorithm.


the purpos of normal the vector (make it have 1-norm
equal to 1) is to avoid that the vector becom either veri larg of
veri small, and thu unrepresent in the float point system. 


 we
will see later that normal is not necessari in the pagerank
comput. 


in thi context there is no need to comput an eigenvalu
approxim, as the eigenvalu sought for is known to be equal to
on. 


the converg of the power method depend on the distribut of
eigenvalu. 


in order to make the present simpler, we assum that
 is diagonaliz, i.e


. 


there exist a 
nonsingular matrix  of eigenvector, 
. 


the eigenvalu  ar order 


  expand the initi approxim  in term of
the eigenvector,
where  is assum thi assumpt can be expect
to be satisfi in float point arithmet, if not at the first
iter, so after the second, due to round off..


 then we have 


obvious, sinc for  we have 
, the second term tend to zero and the power method converg to
the eigenvector . 


the rate of converg is determin by the
ratio . 


if thi ratio is close to 1,
then the iter is veri slow. 


fortun thi is not the case for
the googl matrix, see theorem  and below. 


a stop criterion for the power iter can be formul in
term of the residu vector for the eigenvalu problem. 


let 
 be the comput approxim of the eigenvalu, and 
the correspond approxim eigenvector. 


then it can be shown
 that the optim error matrix , for which
exactli, satisfi
where . 


thi mean that if the residu  is small, then the
comput approxim 
eigenvector  is the exact eigenvector of a matrix  that
is close to . 


in the  case of pagerank
comput it is natur to us the 1-norm instead ,
which doe not make much differ, sinc the norm ar equival
eq:norm-eq. 


in view of the huge dimens of the googl matrix, it is
non-trivial to comput the matrix-vector product , where 
. 


first, we
 show that normal of the vector produc in the power 
iter is unnecessari. 


 assum that  the vector  satisfi , and
 that the matrix  is is column-stochast. 


then 


 put . 


then 
sinc    is column-stochast ().


then recal that  wa construct from the actual link matrix 
as 
where the row vector  ha an element 1 in all those posit that
correspond to web page with no outlink, see eq:p. 


thi
mean that to form , we insert a larg number of full vector in
, each of the same dimens as the total number of web
page. 


consequ, we cannot afford to store 
explicitli. 


 let us look at the multipl  in
some more detail:


where 


howev, we do not need to comput  from thi equat. 


instead
we can us eq:normi in combin with eq:yq:


thu, we have   an extra bonu is
that we do not us the vector  at all, i.e
 we  need  not know which
page lack outlink. 


the follow matlab code implement the matrix vector multipl.


here  or a person teleport vector, see
p. 


in order to save memori, we even
should avoid us the extra vector yhat and replac it by 
y. 


from theorem  we know that  the
second eigenvalu of the googl matrix satisfi . 


a typic valu of
 is 0.8


  approxim   iter ar need
to make the factor  equal to . 


thi is report
  to be
close the number of iter us by googl. 


as an exampl we us the matrix  obtain from the domain 
stanford


the number of
page is  281903, and the total number of link is 2312497. 


part  of the
matrix is displai in figur .


we comput the page rank vector us the
power method with  and iter 63 time until the 1-norm
of the residu wa smaller than . 


the residu and the final 
pagerank vector ar illustr in figur . 


if on comput the pagerank for a subset of the internet, on
particular domain, sai, then the matrix  mai be of a dimens for
which on can us other method than the power method. 


in such case
it mai be suffici to us the matlab function , which
comput a small number of eigenvalu and the correspond
eigenvector of a spars matrix us a variant of the arnoldi (or
lanczo) method, cf. 


section . 


  in view of the fact that on pagerank calcul can
take sever dai, sever enhanc of the iter procedur
have been propos. 


in  an adapt method is describ
that check the converg of the compon of the pagerank vector
and avoid perform the power iter for those compon. 


up to
30  speed up ha been report. 


the block structur of the web is
us in , and speedup of a factor 2 have been
report. 


an acceler method base on aitken extrapol
  is describ in .


aggreg method ar discuss in sever paper
by langvil and meyer and in , and   the
arnoldi method  in . 


a variant of pagerank is
propos in . 


further properti of the
pagerank matrix ar given in . 


hit


anoth method base on the link structur of the web wa introduc
at the same time as pagerank
. 


it is call hit (hypertext induc
topic search), and is base on the concept of author and
hub. 


an author is a web page  with sever inlink and a hub
ha sever outlink. 


the basic idea is: good hub point to good
author and good author ar point to by good hub. 


each web
page is assign both a hub score   and an author score . 


let  be the adjac matrix
   of the direct web graph. 


then two
equat ar given that mathemat  defin the relat between
the two score, base on the basic idea:


the algorithm for comput the score is the power method, which
converg to the left and right singular vector correspond to the
largest singular valu of . 


in the implement of hit it is not
the adjac matrix of the whole web that is us, but of all the
page relev to the queri. 


there is now an extens literatur on pagerank, hit and other rank
method. 


for overview, see . 


a combin of hit and pagerank ha been propos in
.


obvious the idea underli pagerank and hit ar not restrict to
web applic, but can be appli to other network analys. 


for
instanc, a variant of the hit method wa recent us in a studi of
suprem court preced . 


 a gener of hit is
given in , which also treat synonym extract.