好得很程序员自学网

<tfoot draggable='sEl'></tfoot>

一篇关于相似性解释的文章,写得非常的仔细

一篇关于相似性解释的文章,写得非常的仔细

Article 2 of the series  Information Retrieval Tutorials

For n number of dimensions, we just need to keep adding product terms to Equation 1 and 2. It cannot get any easier than this.

When C says: "How far are you from me, A and B?"

To define a straight line we need at least two points. So, if we draw a straight line from C to either A or B, we can define the distance, d, between the points. This is the so-called Euclidean Distance, which can be computed in four easy steps. For any two points defining a straight line:

take the difference between the coordinates of the points square all differences add all squared differences square root the final result

That's all.

Since we have defined x0 = 0 and y0 = 0, then to find out how far A is from C we write the Euclidean Distance as

Equation 3: d AC  = ((x 1  - x 0 ) 2  + (y 1  - y 0 ) 2 ) 1/2  = (x 1 2  + y 1 2 ) 1/2

Similarly, to find out how far B is from C we write

Equation 3: d BC  = ((x 2  - x 0 ) 2  + (y 2  - y 0 ) 2 ) 1/2  = (x 2 2  + y 2 2 ) 1/2 . Figure 2 provides a good description of Euclidean Distances.

Figure 2. Straight lines representing Euclidean Distances between points A and B, with point C.

Meet my Uncle Vector

The straight lines shown in Figure 2 can be replaced by vectors -represented as arrows. A vector is a quantity with direction and magnitude. The head and angle of the arrow indicates the direction of the vector, while its magnitude is defined by the usual Euclidean Distance. Since in our example x0 = 0 and y0 = 0, we can simplify and express the magnitudes of the A and B vectors as d AC  = | A | and d BC  = | B |. The pipe symbol is used to indicate that we are dealing with absolute magnitudes. Thus, the length of the arrows represents the magnitudes of the vectors and the angle described by the vectors represents their orientation in the two-dimensional space. This is illustrated in Figure 3.

Figure 3. A and B Vectors.

Meet my COSIM: Document Length Normalization

To normalize the A?B DOT Product we divide this by the Euclidean Distance between A and B; i.e., A?B/(|A||B|). This ratio defines the cosine angle between the vectors, with values between 0 and 1.

In information retrieval applications this ratio is calculated to normalize the length of documents since long documents tend to have large term frequencies. A more advanced document normalization method, known as  Pivoted Unique Normalization , is described in  MySQL Internals Manual :: 4.7 Full-text Search . This method is based on Singhal, Buckley and Mitra's old method known as  Pivoted Document Length Normalization . (6 - 8).

Let's go back to the normalized DOT Product (cosine angle). This ratio is also used as a similarity measure between any two vectors representing documents, queries, snippets or combination of these. The expressions cosine similarity, Sim(A, B), or COSIM are commonly used.

It is now time to meet my COSIM:

Figure 4. The cosine angle between A and B.

As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases.

This is a convenient way of ranking documents; i.e., by measuring how close their vectors are to a query vector. For instance, let say that point A(x1, y1) represents a query and points B(x2, y2), D(x3, y3), E(x4, y4), F(x5, y5), etc represent documents. We should be able to compute the cosine angle between A (the query) and each document and sort these in decreasing order of cosine angles (cosine similarites). This treatment can be extended to entire collection of documents.

To do this we need to construct a term space. The term space is defined by a list (index) of terms. These terms are extracted from the collection of documents to be queried. The coordinates of the points representing documents and queries are defined according to the weighting scheme used.

If weights are defined as mere term counts ( w = tf ) then point coordinates are given by term frequencies; however, we don't have to define term weights in this manner. As a matter of fact, and as previously mentioned, most commercial search engines do not define term weights in this way, not even in terms of keyword density values.

On and on, it is time now to show you the true IR colors of my COSIM:

Figure 5. The cosine similarity (cosine angle) between query and documents.

where the sigma symbol means "the sum of", Q is a query, D is a document relevant to Q and  w  are weights (see reference 4). How these weights are defined determines the significance and usefulness of the cosine similarity measure. By defining  t max  as maximum term frequency in a document,  N  as number of documents in a collection and  n  as number of documents containing a query term, we can redefine term weights as

w = tf/tfmax w = IDF = log(N/n) w = tf*IDF = tf*log(N/n) w = tf*IDF = tf*log((N - n)/n)

or even in terms of variants of  tf  and  IDF , each one with their own customized definition and theoretical interpretation.

This tutorial illustrates that term vector theory is the result of applying an ancillary technique, Vector Analysis, to an information retrieval problem. Vector Analysis itself is independent from the nature of the underlying system.

Term Vector Fast Track Tutorials

Due to many requests, I have created the following fast tracks:

Term Vector Fast Track Tutorial - A Linear Algebra Approach  
The working example covers a vector space consisting of many dimensions.

Term Vector Fast Track Tutorial  
The working example is limited to a vector space consisting of 2 dimensions only.

These fast tracks walk you through five simple steps, showing you how to compute term frequencies, inverse document frequencies, term weights, dot products, vector magnitudes, and cosine similarity values. With this information, a particular term weighting definition, and given any two documents containing query terms and the query (of course), you should be able to determine which document is the best match for the query. Enjoy it.

Tutorial Review

Here are some nice exercises.

Calculate the A?B DOT Product, Euclidean distances and cosine angle for the points at A(2, 1) and B(3, 4) from the origin of a two-dimensional plane. Assume that A represents a query and B represents a document and that there is a third document X represented by a point at X(2, 3). Which document, X or B is more relevant to A? Show calculations. How "far" is X from B? Show calculations. A database collection consists of 1 million documents, of which 200,000 contain the term  holiday  while 250,000 contain the term  season . A document repeats  holiday  7 times and  season  5 times. It is known that  holiday  is repeated more than any other term in the document. Calculate the weight of both terms in this document using three different term weight methods. Try with 

w = tf/tfmax  
w = (tf/tfmax)*IDF = (tf/tfmax)*log(N/n)  
w = (tf/tfmax)*IDF = (tf/tfmax)*log((N - n)/n)  

Repeat exercise 4, this time without including the  tfmax  term in the calculations.

Feedback Questions

Here is a list of feedback questions. I'm trying to answer each one by quoting relevant passages from this tutorial. Additional material is also included.

Q:  Are the points A & B terms or phrases in the document and C is the query? 

A:  "For instance, let say that point A(x1, y1) represents a query and points B(x2, y2), D(x3, y3), E(x4, y4), F(x5, y5), etc represent documents." 

What this means is that in the figures A is a point representing a query and B is a point representing a document. C is just the origin of the graph starting at 0, 0. 

In the figures, a graph represents the term space and the axes are given in term weight units,  w . If weights are defined as  w = tf/tfmax , then the units are normalized term frequencies (normalized term occurrences), but as mentioned before, no major search engine defines term weights using only number of term occurrences in a document. If a system defines term weights using  w = (tf/tfmax)*log((N - n)/n)  or a derivative of this, then the units of the axes must be redefined in that way, too. 

One more thing, the term spaces (the graphs) are created according to number of terms in the index term. We add one axis (dimension) per term. So if the index term consists of 3 terms, we represent docs and queries as vectors in 3 dimensions. If the index consists of n terms, we then need n dimensions to treat docs and queries as vectors. In this case a graphical representation is not possible. 

Q:  What is the distance and angle representing? 

A:  "Thus, the length of the arrows represents the magnitudes of the vectors and the angle described by the vectors represents their orientation in the two-dimensional space".... "As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, meaning that the similarity of whatever is represented by the vectors increases. This is a convenient way of ranking documents by measuring how close their vectors are to a query vector." 

What this means is that more likely, terms in long docs tend to have more occurrences (high  tf  values) than in shorter docs. In addition, long documents tend to have more different terms, increasing the number of matches between a query and a long document (thus, the chances of retrieval over shorter documents). 

The dot product is normalized by dividing by the magnitudes to compute a cosine angle. This is done to normalize for the document lengths. Thus, we can visualize the magnitudes and dividing by these as accounting for document lengths and document normalization. This is also known as cosine normalization. 

However, in a strict sense, this approach (cosine normalization) is not entirely correct and has some drawbacks. Let me explain. 

In classic vector models cosine normalization is used to force the magnitude of the weighted document vector and to allow one to compare the angle between the weighted vectors. So far, this makes sense. Fine. 

However, this creates another problem: longer documents are given smaller term weights and smaller documents are favored over longer ones. Pivoted Unique Normalization tries to correct for discrepancies based on document length between the probability that a document is relevant and the probability that the document will be retrieved. This is why the pivot normalization method was introduced by Singhal, et al while at Cornell. I believe Singhal is now at Google. 

Regarding the angle described by the vectors, the cosine angle is a measure of similarity. It represents just that: similarity, how similar or  alike  a doc is to a query. It does not represent term importance, which is assessed through semantics, contextuality and topic analysis. The angle itself is an "estimate of closeness". 

If documents and queries are very alike their corresponding doc-query angle should be very small and approaching zero (cosine angle approaching 1). On the other hand, if the angle is high, let say, 90 degrees, the vectors would be perpendicular (orthogonal) and the cosine angle would be 0. In such case docs and queries are not related (In IR we actually say that "docs and queries are orthogonal"). 

Thus, 

cosine (90) = 0 (completely unrelated) 
cosine (0) = 1 (completely related) 

If the cosine angle is between 0 and 1 this means that documents and queries are some way related. 

Q:  Are terms/phrases in the document measured against each other or against the whole of the SE's indices? 

A:  "Those of us involved in IR research know that plain  tf*IDF -like models are used in computer science grad schools to introduce students to vector analysis and advanced term weighting methods." 

The answer here is too obvious to be rediscussed: against the queried index collection, via contributions from the  IDF  term, in addition to individual terms via the  tf  term. So, you need both contributions. 

At grad schools students are first exposed to the classic term count model that scores  tf  only. Then they learn about models that use  tf  combined with IDF  to grasp the basic ideas of term vector models. 

Once they learn why we need to include global information, then more advanced methods are introduced (examples: pivoted normalization, Robertson's BM25 Model, Extended Boolean Model, the Generalized Vector Model, etc). In all cases one needs global weights via the  IDF  term, not just  tf  weights. Next:  EF-Ratios Tutorial Prev:  Document Indexing Tutorial References SEWF's thread: Term Vector Theory & Keyword Weights ; E. Garcia (2004). Method for Calculating the importance of a term in a document  SEOCHAT.com (2005). Patents on Duplicated Content and Re-Ranking Methods ; E. Garcia, SES, San Jose; Advanced Track Issues: The Patent Files, August 8 - 11 (2005). Term Vector Theory and Keyword Weights ; E. Garcia (2005). The Classic Vector Space Model ; E. Garcia (2004). MySQL Internals Manual :: 4.7 Full-text Search  MySQL AB Corporation (2005). Implementation and Application of Term Weights in a MySQL Environment ; E. Garcia (2005). Pivoted Document Length Normalization ; Amit Singhal, Chris Buckley, and Mandar Mitra, Cornell University Ithaca NY. Term Vector Fast Track ; E. Garcia (2005).

查看更多关于一篇关于相似性解释的文章,写得非常的仔细的详细内容...

  阅读:34次

上一篇: R tips

下一篇:向量化与并行计算