This page contains a list of corrections (and discussion points) for the book IR Models: Foundations & Relationships: Morgan & Claypool.

page 14: eq 2.23, idf_total: use -log(df_total(t,c)+1)

By Aldo Lipani, January 2016
idf_total(t,c) := -log df_total(t,c): does it make sense?
Response: Yes, even though the def is in the first sense included only to make the range of idf's complete (regarding the range of df's). idf_total is negative for high df's; from this point of view, it does what it is supposed to do. Aldo pointed out that the negative value leads to problems.


Discussion: When discussing this with Aldo, first thing I noticed that it should be -log(df(t,c)+1), similar to the discussion for log(tf_d+1).
Then, a parameter-free TF-IDF model with tf_d and df(t,c) only, would be like this:
w_TF-IDF(t,d,c) := log(tf_d+1) * -log(df(t,c)+1)
While in an isolated view, the idf does what is expected (high values for rare terms), the problem is that high values of tf_d lead to a penalty regarding the score, and this is not intuitive (in other words, violates axioms required for a term weight).

page 33: definition 2.19: add log to the right side of w_RSJ

By Ingo Frommholz, September 2014
Should w_RSJ be the log over the fraction of document frequencies?
Response: Yes, it is a typo. On page 47, equation 2.99 (definition of w_BM25), w_RSJ is assumed to contain the log.


Discussion: The book attempts a unifying definition of term weights. This leaped from one extreme (all weights with log) to the other extreme (all weights without log), as it was difficult to align the TF-IDF and LM sides of IR.
The motivation is to simplify and unify the RSV's on page 78, Figure 2.10, Retrieval Models Overview.

page 54: LM-based RSV's

By Ingo Frommholz, September 2014
JM-LM retrieval status value, page 54, definition 2.42, equation 2.115. Should it be d \cap q?
Response: Yes, one could use "t in d \cap q", but "t in q" captures the more general case.


Discussion: For the LM case, it is always tricky to see whether the product (or sum of logs) is over "t in q" or over "t in d \cap q".
Equation 2.113 makes explicit that, because of P(t|d)=0 for non-document terms, the product can be over "t in d cap q".
Therefore, 2.115 is correct, and so is 2.116.
2.116, however, should be an equation, not a definition, since 2.116 just shows the decomposed form of the definition in 2.115.
The same applies for 2.119, where 2.118 is the definition, and 2.119 shows the decomposed form.
Note that for the Dirich-LM case, the product MUST be over "t in q", before we isolate the document-specific part, as shown in 2.121.

page 108: KL-Divergence: replace H(Y) by H(X)

By Marco Bonzanini, June 2014
Page 108, the section overview: KL divergence ... -H(Y). Should it be -H(X)?
Response: Yes, is a typo. On page 111, equation 3.96, it is correct.


Discussion: The typo is evidence for the high risk to get confused between conditional entropy and KL-Divergence; see page 111.

Many thanks for the suggestions. Thomas Roelleke, January 2016