Summaries
For quick look-ups,
I am maintaining here in loose and informal form summaries
of previous and current work.
1. PRA (1994 - 1997)
In 1993, Norbert Fuhr told me: "The motivation for an intensional PRA
is that "DB \ IR + IR \ DB" does not work." It took me a year to
understand what he meant, and it took around 10 years to understand the
deeper truth underlying this sentence.
2. POOL (1996 - 1999)
Since PRA and PD were relational, and since I found Fagin/Halpern
(Reasoning about Knowledge), Kifer/Lausen (F-Logic), Meghini/Sebastiani/Straccia
(MIRTL) and other knowledge representations exciting, I designed POOL
(Probabilistic Object-oriented Logic), aiming at a seamless
integration of term-based IR, probabilistic reasoning, and object-oriented
modelling.
3. Probability of Being "Informative" (2001 - 2003)
In a probabilistic reasoning system, usually occurrence-based
probabilities are aggregated.
For example, given an event t that occurs n_t times in N trials,
then n_t/N is the probability that the event occurs.
Such estimate is also referred to as the maximum-likelihood estimate
assuming a Binimial distribution.
The aggregation of probabilities is governed by set theory, for example
P(t_1 AND t_2) = P(t_1) * P(t_2) for two independent events.
In IR, the IDF (idf(t,c):=-log P(t|c)) is an important factor.
(t is a term, c is the collection).
If a probabilistic
reasoning system wants a notion of "high probability for rare term",
how can this be achieved?
A normalisation such as -log P(t|c) / -log P(t_min|c) seems an option, since it assigns 1.0 to
rare terms, and less than 1.0 to more frequent terms, and 0.0 to terms
with P(t|c)=1.
The pragmatic person might say: good, let's stop here, that's a value
between 1 and 0, and let us call it a probability.
The theoretically inclined person will ask: What does this probability
mean?
Therefore, I investigated ways to assign a semantics to
idf(t,c)/maxidf(c).
This led to the following equation:
P(t occurs | c) = (1 - P(t informs | c))^maxidf(c)
where
P(t informs | c) = idf(t,c) / maxidf(c) = (-log P(t occurs | c)) /
maxidf(c)
The equation can be proven using the Euler convergence:
e^{-x} = lim_{N->infty} (1 - x/N)^N
For x:=idf(t,c), it follows:
e^{-idf(t,c)} approx (1 - idf(t,c) / maxidf(c)) ^ {maxidf(c)}
Having found a probabilistic semantics makes the theoretician
happy.
This work was actually motivated by a "feature" of the
distinct projection in our PRA.
Gabriella
Kazai used in 2001 a PRA expression to compute an idf-based
probability. She computed 1/N (N is the number of documents) for each
tuple in a non-probabilistic document index. Then, n/N was to be
computed, which corresponds in PRA to
Project DISJOINT[$1](index_1_N)
By accident, the following expression was executed:
Project INDEPENDENT[$1](index_1_N)
Then, the maximal probability for a frequent
term was 1-e^{-1}=0.634, i.e. the probability that the term occurs in
at least one document.
1-e^(-1) = 1 - (1 - n/N)^n
The further investigation led to the probability of being
informative, and later to the PRA-based formulation of probability
estimations, the relational Bayes.
4. A General Matrix Framework (1999 - 2006)
This work was originally motivated by a relationship between the total
probability and the GVSM (generalised vector-space model):
\sum_t P(q|t) P(t|d) = P(q) * sum_t P(t|q) P(t|d) * 1/P(t)
5. The Relational Bayes (2002 - 2006)
Long paper (and story) cut short, the
conventional PRA rests on the famous five operators
of relational algebra:
Select, Project, Join, Unite, Subtract.
The famous five AGGREGATE probabilities.
The new operator, the relational Bayes, ESTIMATES probabilities.
The operations are:
Select[condition](prae)
Project assumption [attributes](prae)
Join assumption [condition](prae1, prae2)
Unite assumption (prae1, prae2)
Subtract assumption (prae1, prae2)
Bayes assumption [evidenceKeyAttributes](prae)
PRAE stands for probabilistic relational algebra expression.
6. Relevance Feedback: A Loss of Entropy but a Gain for IDF? (2004)
The BIR term weight can be formulated in an idf-based way. For positive term events:
w_bir(t,r,\bar{r}) = idf(t,r) - idf(t,\bar{r})
For positive and negative term events:
w_bir(t,r,\bar{r}) = idf(t,r) - idf(t,\bar{r})
+ idf(\bar{t},\bar{r}) - idf(\bar{t},r)
Here, idf is defined as the logarithm
of the document-based term probability P_D(t|x):
idf(t,x) := -log P_D(t|x)
Sets of different granularity are used for estimating probabilities
(few relevant documents, many non-relevant documents for the case that
statistics over non-relevant are approximated by statistics over the whole
collection).
Combining two probabilities can be very
different from the probability obtained when combining on frequency level.
For example, assume two probabilities: 0.5=10/20, 0.8=8,000,000/10,000,000.
Combining the evidence on frequency level
yields (8,000,000+10)/(10,000,000+20), value dominated by the probability
based on large frequencies.
Therefore, the idea to replace n/N-based probabilities by
the BM25 TF normalisation n/(n+K) or by Poisson-based (exponential probability functions), since then probabilities
and idf-values might be more comparable than they are for sets
of different granularity.
7. A Parallel Derivation of Probabilistic Retrieval Models (2004 - 2006)
Documents are represented in different forms:
- As binary vectors where 1 means term occurs, and 0 means term
does not occur in document. Representation used in the BIR model.
- As sequences of terms. Used in Language Modelling.
- As frequency vectors. Frequency values are in 0 to infinite.
We observed that in some approaches the estimates appeared in mixed
form. Therefore, we became interested into a clear derivation. Also,
this was important for the clear definition of PRA-based probabilistic
estimates via the relational Bayes where location-based (sequences of
terms) and document-based probabilities can be specified.
Model
|
BIR
|
Poisson
|
Language Modelling
|
Event space
|
{0,1}
|
0,1,2,...
|
{t_1,t_2,...}
|
Background model
|
sets of documents
|
averages observed in collection
|
sequence of terms in collection
|
|
|
|
|
Probabilities
|
P(t|r) = n_D(t,r)/N_D(r)
P(t|\bar{r}) =
n_D(t,\bar{r})/N_D(\bar{r})
|
P(k_t|c) = lambda^k_t / k_t! *
e^{-lambda}
|
P(t|c) = n_L(t,c) / N_L(c)
P(t|d) = n_L(t,d) / N_L(d)
|
The work made a relationship between location-based and document-based
term probabilities explicit. This was referred to as the
Poisson bridge (because of lambda = n_L(t,c) / N_D(c)):
P_L(t|c) * avgdl(c) = P_D(t|c) * avgtf(t,c) = lambda(t,c)
The proof is by inserting the definitions:
n_L(t,c)/N_L(c) * N_L(c)/N_D(c) = n_D(t,c)/N_D(c)
* n_L(t,c)/n_D(t,c) = n_L(t,c) / N_D(c)
The Poisson bridge helps to transfer location-based into document-based
probabilities, or the other way round. This is useful when relating
retrieval models.
8. TF-IDF and Language Modelling: A Study of Probabilities and
Theories (2006 - 2008)
8.1 Independent term events
Language modelling is based on P(q|d). Hiemstra:2000 showed that
because of P(q) constant for documents, P(q|d)/P(q) can be computed (to
be more precise, P(q|d)/(prod_t (1-lambda)P(t|c)).
Our research on modelling TF-IDF and Language Modelling led to a
formulation of TF-IDF based on P(d|q)/P(d). This is relatively
exciting, since it shows that TF-IDF and Language Modelling are
actually the same before decomposition of query (document)
probabilities.
P(q|d) / P(q) = P(d,q) / (P(d) P(q)) = P(d|q) / P(d)
For LM, a form of smoothing helps to estimate P(q|d). For example,
linear smoothing:
P(q|d) = lambda P(t|d) + (1-lambda) P(t|c)
For explaining TF-IDF via P(d|q), an "extreme" mixture helps:
P(d|q) = P(t|q) if t in q, P(t|c) otherwise
P(d|q)/P(d) = prod_{t in d \cap q} (P(t|q) / P(t|c))^{n_L(t,d)}
Then, there is a bit more work to to to reach TF-IDF. Essentially, the
location-based probability P(t|c) needs to be transferred into the
document-based probability. The notation P_L(t|c) and P_D(t|c) makes
the event space of the probabilities explicit.
The assumption
P_L(t|q) P_L(t|c) = P_D(t|c)
transfers the location-based expression to the definition of "naive"
TF-IDF:
RSV_{naive TF-IDF}(d,q,c) := sum_{t in d \cap q} n_L(t,d) -log P_D(t|c)
We refer to this as "naive" because of the total term frequency
n_L(t,d). We were looking for a probabilistic semantics to reach a TF
normalisation that works.
8.2 Disjoint term events
Alternative to P(x) = prod_t P(t|x), there is always the possibility to
estimate P(x) based on a space of disjoint and exhaustive events:
P(x) = sum_t P(x|t) P(t).
Having achieved the explanation of TF-IDF for independent terms, the
question was: How does the derivation work for disjoint terms.
P(q|d) = \sum_t P(q|t) P(t|d) = P(q) * sum_t P(t|q) P(t|d) * 1/P(t)
P(q|d)/P(q) = \sum_t P(t|q) P(t|d) * 1/P(t)
P(d|q) = \sum_t P(d|t) P(t|q) = P(d) * sum_t P(t|q) P(t|d) * 1/P(t)
P(d|q)/P(d) = \sum_t P(t|q) P(t|d) * 1/P(t)
Here, the integral
\int 1/x = \log x
brought the connection.
9. Semi-subsumed Events: A Semantics of the BM25 Term Frequency
(2008/2009)
The BM25 TF quantification n/(n+K) tells us that for IR, the occurrences
of a term are not independent.
Amati/Rijsbergen:2002 discuss a risk-based explanation of the famous
and effective BM25 TF quantification:
n / (n+K) = n * 1/(n+K)
K incorporates the pivoted document length (pivdl = dl/avgdl).
The TF quantification rises fast and then saturates.
The pivdl is small for short documents, thus the TF quantification is
larger for short than long documents, and the rise is faster
for short than long documents.
The quantification n/(n+K) was motivated in Robertson/Walker:1994 as an
approximation of the 2-Poisson model. The derivation is slightly
lengthy, and we asked ourselves what the approximation means for IR,
and for probability theory in general.
Thus, the notion of semi-subsumed events was developed.
We show in our poster that n/(n+1) can be viewed as a probabilistic
assumption that lies precisely in the middle between subsumption
and independence.
This assumption explains what the BM25 TF quantification expresses,
namely a dependence between the occurrences of the same term.
Moreover, this assumption allows for
transferring the BM25 TF into probability theory in general.
10. The Case of Logical Retrieval for Patent Search (2009/2010)
Probabilistic logical retrieval is interesting for patent search because
patent search is a complex search task, and the expressiveness
of logic helps to model the search process, capture the evidence that
led to making a decision.
The price for expressiveness is efficiency.
It is challenging to
achieve efficient logical retrieval on large-scale data.
Pushing techniques of distributed IR into a reasoning engine
is a potential way forward.
The low-level implementation is time-consuming and bears many risks.
Therefore,
we looked into ways to use
probabilistic logical modelling to model distributed IR.
We could achieve a scalable technology
by utilising the PRA layer/engine itself to
model strategies of distributed IR.
11. Descriptive Modelling of Classification and Learning (2010)
To appear.