Life Long Programmer's Community Log: Lucene/Solr Search Similarity tf(t in d) correlates to the term's frequency, defined as the number of...

Please Visit: http://ift.tt/1ajReyV

Lucene/Solr Search Similarity
tf(t in d) correlates to the term's frequency, defined as the number of times term t appears in the currently scored document d. Documents that have more occurrences of a given term receive a higher score.
Math.sqrt(freq)

idf(t) stands for Inverse Document Frequency. This value correlates to the inverse of docFreq (the number of documents in which the term t appears). This means rarer terms give higher contribution to the total score
Math.log(numDocs/(double)(docFreq+1)) + 1.0

Query Coordination
coord(q,d) is a score factor based on how many of the query terms are found in the specified document. Typically, a document that contains more of the query's terms will receive a higher score than another document with fewer query terms. - computed at search time
overlap / (float)maxOverlap

The coordination factor (coord) is used to reward documents that contain a higher percentage of the query terms. The more query terms that appear in the document, the greater the chances that the document is a good match for the query.

queryNorm(q) is a normalizing factor used to make scores between queries comparable. This factor does not affect document ranking (since all ranked documents are multiplied by the same factor), but rather just attempts to make scores from different queries (or even different indexes) comparable.
1.0 / Math.sqrt(sumOfSquaredWeights)
The sumOfSquaredWeights is calculated by adding together the IDF of each term in the query, squared.

t.getBoost() is a search time boost of term t in the query q as specified in the query text

Index-Time Field-Level Boosting
We strongly recommend against using field-level index-time boosts

norm(t,d) encapsulates a few (indexing time) boost and length factors:
Field boost - set by calling field.setBoost() before adding the field to a document.
lengthNorm(Field-length norm)- computed when the document is added to the index in accordance with the number of tokens of this field in the document, so that shorter fields contribute more to the score. LengthNorm is computed by the Similarity class in effect at indexing.

DefaultSimilarity extends TFIDFSimilarity
Boolean Model
Vector Space Model
http://ift.tt/1cLXCHE
http://ift.tt/1JueFuB

Implementation of Similarity with the Vector Space Model. Expert: Scoring API. TFIDFSimilarity defines the components of Lucene scoring. Overriding computation of these components is a convenient way to alter Lucene scoring. Suggested reading: Introduction To Information Retrieval, Chapter 6.

from Public RSS-Feed of Jeffery yuan. Created with the PIXELMECHANICS 'GPlusRSS-Webtool' at http://gplusrss.com http://ift.tt/1cLXCHG
via LifeLong Community

Life Long Programmer's Community Log

Lucene/Solr Search Similarity tf(t in d) correlates to the term's frequency, defined as the number of...

No comments:

Post a Comment