Re: Word-Level Inverted File Structure

From: Tim Mills <tjm_at_orl.co.uk>
Date: Thu, 12 Oct 2000 09:02:07 +0100
Message-ID: <8s3r8l$6is$1_at_pea.uk.research.att.com>


"Pete Nayler" <nayler_at_dingoblue.net.au> wrote in message news:39e53208$0$27119$7f31c96c_at_news01.syd.optusnet.com.au...
>
> "Jan Hidders" <hidders_at_REMOVE.THIS.win.tue.nl> wrote in message
> news:8s1gs2$mcu$1_at_news.tue.nl...
> > Pete Nayler wrote:
> > >
> > > "Jan Hidders" <hidders_at_REMOVE.THIS.win.tue.nl> wrote in message
> > > news:8s19pp$knf$1_at_news.tue.nl...
> > > > Pete Nayler wrote:
> > > > > The structure I'm referring to is explained in Witten et al
 "Managing
> > > > > Gigabytes", where each word in an inverted file is referenced
 using:
> > > > >
> > > > > <2;(1;6,9),(4;8)>
> > > > >
> > > > > where the (bracketed) terms can be expressed as
> > > > >
> > > > > (x ; y1, y2, y3, ...)
> > > > >
> > > > > where x represents the document in which the word exists, and y
> > > > > represents the word position in the document.
> > > > >
> > > > > The question is, what does the first term in the full structure
> > > > > represent?
> > > >
> > > > I'm totally guessing here, but could it be the word for which the
> > > > positions are indicated?
> > >
> > > Thanks for the reply, but in the book, it gives an example of indexing
> > > using a series of documents, giving the word listing as follows:
> > >
> > > cold - <2;(1;6),(4;8)>
> > > hot - <2;(3;2),(6;2)>
> > > warm - <2;(1;3),(4;4)>
> > > etc...
> > >
> > > As you can see, the first term is always "2", which preceeds the
 document
> > > and then the position. Puzzling...
 

> > Ok. Let's try another guess: the number of documents that the word
 occurs
> > in?
>
> Hmmmm - you may have it there. But one question.... what purpose would
 that
> serve? It wouldn't really help in relevance ranking or sorting. Seems to
 me
> a bizarre thing to have in a string that contains more detailed
 information
> about the word.

Most forms of ranking are based on the classic tf.idf ranking, Roughly speaking:

  tf(t, d) = term frequency
           = number of occurrences of term t in document d
           = N in t (x: y1, y2 .... yN)
 idf(t)    = inverse document frequency
           = total number of documents/number of documents containing term t
           = total number of documents/x

tf(t, d) says that if a term t appears frequently in document d, it has a good
chance of being about t.
idf(t) says that if a term t appears frequently throughout a collection, it might
not be a particularly good indicator of relevance.

See:

http://www.ftp.cl.cam.ac.uk/ftp/papers/reports/TR356-ksj-approaches-to-text- retrieval.html

for more information of text ranking methods. Received on Thu Oct 12 2000 - 10:02:07 CEST

Original text of this message