In IR, the IDF (idf(t,c):=-log P(t|c)) is an important factor. (t is a term, c is the collection). If a probabilistic reasoning system wants a notion of "high probability for rare term", how can this be achieved?

A normalisation such as -log P(t|c) / -log P(t_min|c) seems an option, since it assigns 1.0 to rare terms, and less than 1.0 to more frequent terms, and 0.0 to terms with P(t|c)=1.

The pragmatic person might say: good, let's stop here, that's a value between 1 and 0, and let us call it a probability. The theoretically inclined person will ask: What does this probability mean?

Therefore, I investigated ways to assign a semantics to idf(t,c)/maxidf(c). This led to the following equation:

P(t occurs | c) = (1 - P(t informs | c))^maxidf(c)

where

P(t informs | c) = idf(t,c) / maxidf(c) = (-log P(t occurs | c)) /
maxidf(c)

The equation can be proven using the Euler convergence:

e^{-x} = lim_{N->infty} (1 - x/N)^N

For x:=idf(t,c), it follows:

e^{-idf(t,c)} approx (1 - idf(t,c) / maxidf(c)) ^ {maxidf(c)}

Having found a probabilistic semantics makes the theoretician
happy.

This work was actually motivated by a "feature" of the
distinct projection in our PRA.
Gabriella
Kazai used in 2001 a PRA expression to compute an idf-based
probability. She computed 1/N (N is the number of documents) for each
tuple in a non-probabilistic document index. Then, n/N was to be
computed, which corresponds in PRA to

Project DISJOINT[$1](index_1_N)

By accident, the following expression was executed:

Project INDEPENDENT[$1](index_1_N)

Then, the maximal probability for a frequent
term was 1-e^{-1}=0.634, i.e. the probability that the term occurs in
at least one document.

1-e^(-1) = 1 - (1 - n/N)^n

The further investigation led to the probability of being
informative, and later to the PRA-based formulation of probability
estimations, the relational Bayes.

\sum_t P(q|t) P(t|d) = P(q) * sum_t P(t|q) P(t|d) * 1/P(t)

The operations are:

Select[condition](prae)

Project assumption [attributes](prae)

Join assumption [condition](prae1, prae2)

Unite assumption (prae1, prae2)

Subtract assumption (prae1, prae2)

Bayes assumption [evidenceKeyAttributes](prae)

PRAE stands for probabilistic relational algebra expression.

w_bir(t,r,\bar{r}) = idf(t,r) - idf(t,\bar{r})

For positive and negative term events:

w_bir(t,r,\bar{r}) = idf(t,r) - idf(t,\bar{r}) + idf(\bar{t},\bar{r}) - idf(\bar{t},r)

Here, idf is defined as the logarithm of the document-based term probability P_D(t|x):

idf(t,x) := -log P_D(t|x)

Sets of different granularity are used for estimating probabilities (few relevant documents, many non-relevant documents for the case that statistics over non-relevant are approximated by statistics over the whole collection).

Combining two probabilities can be very different from the probability obtained when combining on frequency level. For example, assume two probabilities: 0.5=10/20, 0.8=8,000,000/10,000,000. Combining the evidence on frequency level yields (8,000,000+10)/(10,000,000+20), value dominated by the probability based on large frequencies. Therefore, the idea to replace n/N-based probabilities by the BM25 TF normalisation n/(n+K) or by Poisson-based (exponential probability functions), since then probabilities and idf-values might be more comparable than they are for sets of different granularity.

- As binary vectors where 1 means term occurs, and 0 means term
does not occur in document. Representation used in the BIR model.

- As sequences of terms. Used in Language Modelling.
- As frequency vectors. Frequency values are in 0 to infinite.

Model |
BIR |
Poisson |
Language Modelling |

Event space |
{0,1} |
0,1,2,... |
{t_1,t_2,...} |

Background model |
sets of documents |
averages observed in collection |
sequence of terms in collection |

Probabilities |
P(t|r) = n_D(t,r)/N_D(r) P(t|\bar{r}) = n_D(t,\bar{r})/N_D(\bar{r}) |
P(k_t|c) = lambda^k_t / k_t! *
e^{-lambda} |
P(t|c) = n_L(t,c) / N_L(c) P(t|d) = n_L(t,d) / N_L(d) |

The work made a relationship between location-based and document-based term probabilities explicit. This was referred to as the

Poisson bridge (because of lambda = n_L(t,c) / N_D(c)):

P_L(t|c) * avgdl(c) = P_D(t|c) * avgtf(t,c) = lambda(t,c)

The proof is by inserting the definitions:

n_L(t,c)/N_L(c) * N_L(c)/N_D(c) = n_D(t,c)/N_D(c) * n_L(t,c)/n_D(t,c) = n_L(t,c) / N_D(c)

The Poisson bridge helps to transfer location-based into document-based probabilities, or the other way round. This is useful when relating retrieval models.

Our research on modelling TF-IDF and Language Modelling led to a formulation of TF-IDF based on P(d|q)/P(d). This is relatively exciting, since it shows that TF-IDF and Language Modelling are actually the same before decomposition of query (document) probabilities.

P(q|d) / P(q) = P(d,q) / (P(d) P(q)) = P(d|q) / P(d)

For LM, a form of smoothing helps to estimate P(q|d). For example, linear smoothing:

P(q|d) = lambda P(t|d) + (1-lambda) P(t|c)

For explaining TF-IDF via P(d|q), an "extreme" mixture helps:

P(d|q) = P(t|q) if t in q, P(t|c) otherwise

P(d|q)/P(d) = prod_{t in d \cap q} (P(t|q) / P(t|c))^{n_L(t,d)}

Then, there is a bit more work to to to reach TF-IDF. Essentially, the location-based probability P(t|c) needs to be transferred into the document-based probability. The notation P_L(t|c) and P_D(t|c) makes the event space of the probabilities explicit.

The assumption

P_L(t|q) P_L(t|c) = P_D(t|c)

transfers the location-based expression to the definition of "naive" TF-IDF:

RSV_{naive TF-IDF}(d,q,c) := sum_{t in d \cap q} n_L(t,d) -log P_D(t|c)

We refer to this as "naive" because of the total term frequency n_L(t,d). We were looking for a probabilistic semantics to reach a TF normalisation that works.

P(x) = sum_t P(x|t) P(t).

Having achieved the explanation of TF-IDF for independent terms, the question was: How does the derivation work for disjoint terms.

P(q|d) = \sum_t P(q|t) P(t|d) = P(q) * sum_t P(t|q) P(t|d) * 1/P(t)

P(q|d)/P(q) = \sum_t P(t|q) P(t|d) * 1/P(t)

P(d|q) = \sum_t P(d|t) P(t|q) = P(d) * sum_t P(t|q) P(t|d) * 1/P(t)

P(d|q)/P(d) = \sum_t P(t|q) P(t|d) * 1/P(t)

Here, the integral

\int 1/x = \log x

brought the connection.

n / (n+K) = n * 1/(n+K)

K incorporates the pivoted document length (pivdl = dl/avgdl). The TF quantification rises fast and then saturates. The pivdl is small for short documents, thus the TF quantification is larger for short than long documents, and the rise is faster for short than long documents.

The quantification n/(n+K) was motivated in Robertson/Walker:1994 as an approximation of the 2-Poisson model. The derivation is slightly lengthy, and we asked ourselves what the approximation means for IR, and for probability theory in general.

Thus, the notion of semi-subsumed events was developed.

We show in our poster that n/(n+1) can be viewed as a probabilistic assumption that lies precisely in the middle between subsumption and independence.

This assumption explains what the BM25 TF quantification expresses, namely a dependence between the occurrences of the same term. Moreover, this assumption allows for transferring the BM25 TF into probability theory in general.

The price for expressiveness is efficiency.

It is challenging to achieve efficient logical retrieval on large-scale data.

Pushing techniques of distributed IR into a reasoning engine is a potential way forward. The low-level implementation is time-consuming and bears many risks. Therefore, we looked into ways to use probabilistic logical modelling to model distributed IR. We could achieve a scalable technology by utilising the PRA layer/engine itself to model strategies of distributed IR.