Summaries

For quick look-ups, I am maintaining here in loose and informal form summaries of previous and current work.

1. PRA (1994 - 1997)

In 1993, Norbert Fuhr told me: "The motivation for an intensional PRA is that "DB \ IR + IR \ DB" does not work." It took me a year to understand what he meant, and it took around 10 years to understand the deeper truth underlying this sentence.

2. POOL (1996 - 1999)

Since PRA and PD were relational, and since I found Fagin/Halpern (Reasoning about Knowledge), Kifer/Lausen (F-Logic), Meghini/Sebastiani/Straccia (MIRTL) and other knowledge representations exciting, I designed POOL (Probabilistic Object-oriented Logic), aiming at a seamless integration of term-based IR, probabilistic reasoning, and object-oriented modelling.

3. Probability of Being "Informative" (2001 - 2003)

In a probabilistic reasoning system, usually occurrence-based probabilities are aggregated. For example, given an event t that occurs n_t times in N trials, then n_t/N is the probability that the event occurs. Such estimate is also referred to as the maximum-likelihood estimate assuming a Binimial distribution. The aggregation of probabilities is governed by set theory, for example P(t_1 AND t_2) = P(t_1) * P(t_2) for two independent events.

In IR, the IDF (idf(t,c):=-log P(t|c)) is an important factor. (t is a term, c is the collection). If a probabilistic reasoning system wants a notion of "high probability for rare term", how can this be achieved?

A normalisation such as -log P(t|c) / -log P(t_min|c) seems an option, since it assigns 1.0 to rare terms, and less than 1.0 to more frequent terms, and 0.0 to terms with P(t|c)=1.

The pragmatic person might say: good, let's stop here, that's a value between 1 and 0, and let us call it a probability. The theoretically inclined person will ask: What does this probability mean?

Therefore, I investigated ways to assign a semantics to idf(t,c)/maxidf(c). This led to the following equation:

P(t occurs | c) = (1 - P(t informs | c))^maxidf(c)

where
P(t informs | c) = idf(t,c) / maxidf(c) = (-log P(t occurs | c)) / maxidf(c)

The equation can be proven using the Euler convergence:

e^{-x} = lim_{N->infty} (1 - x/N)^N

For x:=idf(t,c), it follows:

e^{-idf(t,c)} approx (1 - idf(t,c) / maxidf(c)) ^ {maxidf(c)}

Having found a probabilistic semantics makes the theoretician happy.

This work was actually motivated by a "feature" of the distinct projection in our PRA. Gabriella Kazai used in 2001 a PRA expression to compute an idf-based probability. She computed 1/N (N is the number of documents) for each tuple in a non-probabilistic document index. Then, n/N was to be computed, which corresponds in PRA to

Project DISJOINT[$1](index_1_N)

By accident, the following expression was executed:

Project INDEPENDENT[$1](index_1_N)

Then, the maximal probability for a frequent term was 1-e^{-1}=0.634, i.e. the probability that the term occurs in at least one document.

1-e^(-1) = 1 - (1 - n/N)^n

The further investigation led to the probability of being informative, and later to the PRA-based formulation of probability estimations, the relational Bayes.

4. A General Matrix Framework (1999 - 2006)

This work was originally motivated by a relationship between the total probability and the GVSM (generalised vector-space model):

\sum_t P(q|t) P(t|d) = P(q) * sum_t P(t|q) P(t|d) * 1/P(t)

5. The Relational Bayes (2002 - 2006)

Long paper (and story) cut short, the conventional PRA rests on the famous five operators of relational algebra: Select, Project, Join, Unite, Subtract. The famous five AGGREGATE probabilities. The new operator, the relational Bayes, ESTIMATES probabilities.
The operations are:
Select[condition](prae)
Project assumption [attributes](prae)
Join assumption [condition](prae1, prae2)
Unite assumption (prae1, prae2)
Subtract assumption (prae1, prae2)
Bayes assumption [evidenceKeyAttributes](prae)
PRAE stands for probabilistic relational algebra expression.

6. Relevance Feedback: A Loss of Entropy but a Gain for IDF? (2004)

The BIR term weight can be formulated in an idf-based way. For positive term events:
w_bir(t,r,\bar{r}) = idf(t,r) - idf(t,\bar{r})
For positive and negative term events:
w_bir(t,r,\bar{r}) = idf(t,r) - idf(t,\bar{r}) + idf(\bar{t},\bar{r}) - idf(\bar{t},r)
Here, idf is defined as the logarithm of the document-based term probability P_D(t|x):
idf(t,x) := -log P_D(t|x)
Sets of different granularity are used for estimating probabilities (few relevant documents, many non-relevant documents for the case that statistics over non-relevant are approximated by statistics over the whole collection).
Combining two probabilities can be very different from the probability obtained when combining on frequency level. For example, assume two probabilities: 0.5=10/20, 0.8=8,000,000/10,000,000. Combining the evidence on frequency level yields (8,000,000+10)/(10,000,000+20), value dominated by the probability based on large frequencies. Therefore, the idea to replace n/N-based probabilities by the BM25 TF normalisation n/(n+K) or by Poisson-based (exponential probability functions), since then probabilities and idf-values might be more comparable than they are for sets of different granularity.

7. A Parallel Derivation of Probabilistic Retrieval Models (2004 - 2006)

Documents are represented in different forms:

As binary vectors where 1 means term occurs, and 0 means term does not occur in document. Representation used in the BIR model.
As sequences of terms. Used in Language Modelling.
As frequency vectors. Frequency values are in 0 to infinite.

We observed that in some approaches the estimates appeared in mixed form. Therefore, we became interested into a clear derivation. Also, this was important for the clear definition of PRA-based probabilistic estimates via the relational Bayes where location-based (sequences of terms) and document-based probabilities can be specified.

Model	BIR	Poisson	Language Modelling
Event space	{0,1}	0,1,2,...	{t_1,t_2,...}
Background model	sets of documents	averages observed in collection	sequence of terms in collection

Probabilities	P(t\|r) = n_D(t,r)/N_D(r) P(t\|\bar{r}) = n_D(t,\bar{r})/N_D(\bar{r})	P(k_t\|c) = lambda^k_t / k_t! * e^{-lambda}	P(t\|c) = n_L(t,c) / N_L(c) P(t\|d) = n_L(t,d) / N_L(d)

The work made a relationship between location-based and document-based term probabilities explicit. This was referred to as the
Poisson bridge (because of lambda = n_L(t,c) / N_D(c)):

P_L(t|c) * avgdl(c) = P_D(t|c) * avgtf(t,c) = lambda(t,c)

The proof is by inserting the definitions:

n_L(t,c)/N_L(c) * N_L(c)/N_D(c) = n_D(t,c)/N_D(c) * n_L(t,c)/n_D(t,c) = n_L(t,c) / N_D(c)

The Poisson bridge helps to transfer location-based into document-based probabilities, or the other way round. This is useful when relating retrieval models.

8. TF-IDF and Language Modelling: A Study of Probabilities and Theories (2006 - 2008)

8.1 Independent term events

Language modelling is based on P(q|d). Hiemstra:2000 showed that because of P(q) constant for documents, P(q|d)/P(q) can be computed (to be more precise, P(q|d)/(prod_t (1-lambda)P(t|c)).

Our research on modelling TF-IDF and Language Modelling led to a formulation of TF-IDF based on P(d|q)/P(d). This is relatively exciting, since it shows that TF-IDF and Language Modelling are actually the same before decomposition of query (document) probabilities.

P(q|d) / P(q) = P(d,q) / (P(d) P(q)) = P(d|q) / P(d)

For LM, a form of smoothing helps to estimate P(q|d). For example, linear smoothing:
P(q|d) = lambda P(t|d) + (1-lambda) P(t|c)

For explaining TF-IDF via P(d|q), an "extreme" mixture helps:
P(d|q) = P(t|q) if t in q, P(t|c) otherwise

P(d|q)/P(d) = prod_{t in d \cap q} (P(t|q) / P(t|c))^{n_L(t,d)}

Then, there is a bit more work to to to reach TF-IDF. Essentially, the location-based probability P(t|c) needs to be transferred into the document-based probability. The notation P_L(t|c) and P_D(t|c) makes the event space of the probabilities explicit.

The assumption

P_L(t|q) P_L(t|c) = P_D(t|c)

transfers the location-based expression to the definition of "naive" TF-IDF:

RSV_{naive TF-IDF}(d,q,c) := sum_{t in d \cap q} n_L(t,d) -log P_D(t|c)

We refer to this as "naive" because of the total term frequency n_L(t,d). We were looking for a probabilistic semantics to reach a TF normalisation that works.

8.2 Disjoint term events

9. Semi-subsumed Events: A Semantics of the BM25 Term Frequency (2008/2009)

The BM25 TF quantification n/(n+K) tells us that for IR, the occurrences of a term are not independent. Amati/Rijsbergen:2002 discuss a risk-based explanation of the famous and effective BM25 TF quantification:

n / (n+K) = n * 1/(n+K)

K incorporates the pivoted document length (pivdl = dl/avgdl). The TF quantification rises fast and then saturates. The pivdl is small for short documents, thus the TF quantification is larger for short than long documents, and the rise is faster for short than long documents.

The quantification n/(n+K) was motivated in Robertson/Walker:1994 as an approximation of the 2-Poisson model. The derivation is slightly lengthy, and we asked ourselves what the approximation means for IR, and for probability theory in general.

Thus, the notion of semi-subsumed events was developed.

We show in our poster that n/(n+1) can be viewed as a probabilistic assumption that lies precisely in the middle between subsumption and independence.
This assumption explains what the BM25 TF quantification expresses, namely a dependence between the occurrences of the same term. Moreover, this assumption allows for transferring the BM25 TF into probability theory in general.

10. The Case of Logical Retrieval for Patent Search (2009/2010)

Probabilistic logical retrieval is interesting for patent search because patent search is a complex search task, and the expressiveness of logic helps to model the search process, capture the evidence that led to making a decision.
The price for expressiveness is efficiency.
It is challenging to achieve efficient logical retrieval on large-scale data.
Pushing techniques of distributed IR into a reasoning engine is a potential way forward. The low-level implementation is time-consuming and bears many risks. Therefore, we looked into ways to use probabilistic logical modelling to model distributed IR. We could achieve a scalable technology by utilising the PRA layer/engine itself to model strategies of distributed IR.

11. Descriptive Modelling of Classification and Learning (2010)

To appear.