public class TFIDF extends java.lang.Object implements RelevanceRanker
One well-studied technique is to normalize the tf weights of all terms occurring in a document by the maximum tf in that document. For each document d, let tfmax(d) be the maximum tf over all terms in d. Then, we compute a normalized term frequency for each term t in document d by
tf = a + (1? a) tft,d / tfmax(d)
where a is a value between 0 and 1 and is generally set to 0.4, although some early work used the value 0.5. The term a is a smoothing term whose role is to damp the contribution of the second term - which may be viewed as a scaling down of tf by the largest tf value in d. The main idea of maximum tf normalization is to mitigate the following anomaly: we observe higher term frequencies in longer documents, merely because longer documents tend to repeat the same words over and over again. Maximum tf normalization does suffer from the following issues:
BM25| Constructor and Description |
|---|
TFIDF()
Constructor.
|
TFIDF(double smoothing)
Constructor.
|
| Modifier and Type | Method and Description |
|---|---|
double |
rank(Corpus corpus,
TextTerms doc,
java.lang.String[] terms,
int[] tf,
int n)
Returns a relevance score between a set of terms and a document based on a corpus.
|
double |
rank(Corpus corpus,
TextTerms doc,
java.lang.String term,
int tf,
int n)
Returns a relevance score between a term and a document based on a corpus.
|
double |
rank(int tf,
int maxtf,
long N,
long n)
Returns a relevance score between a term and a document based on a corpus.
|
public TFIDF()
public TFIDF(double smoothing)
smoothing - the smoothing parameter in maximum tf normalization.public double rank(int tf,
int maxtf,
long N,
long n)
tf - the frequency of searching term in the document to rank.maxtf - the maximum frequency over all terms in the document.N - the number of documents in the corpus.n - the number of documents containing the given term in the corpus;public double rank(Corpus corpus, TextTerms doc, java.lang.String term, int tf, int n)
RelevanceRankerrank in interface RelevanceRankercorpus - the corpus.doc - the document to rank.term - the searching term.tf - the term frequency in the document.n - the number of documents containing the given term in the corpus;public double rank(Corpus corpus, TextTerms doc, java.lang.String[] terms, int[] tf, int n)
RelevanceRankerrank in interface RelevanceRankercorpus - the corpus.doc - the document to rank.terms - the searching terms.tf - the term frequencies in the document.n - the number of documents containing the given term in the corpus;