Re: Word-Level Inverted File Structure
Date: Thu, 12 Oct 2000 09:02:07 +0100
Message-ID: <8s3r8l$6is$1_at_pea.uk.research.att.com>
"Pete Nayler" <nayler_at_dingoblue.net.au> wrote in message
news:39e53208$0$27119$7f31c96c_at_news01.syd.optusnet.com.au...
>
> "Jan Hidders" <hidders_at_REMOVE.THIS.win.tue.nl> wrote in message
> news:8s1gs2$mcu$1_at_news.tue.nl...
> > Pete Nayler wrote:
> > >
> > > "Jan Hidders" <hidders_at_REMOVE.THIS.win.tue.nl> wrote in message
> > > news:8s19pp$knf$1_at_news.tue.nl...
> > > > Pete Nayler wrote:
> > > > > The structure I'm referring to is explained in Witten et al
"Managing
> > > > > Gigabytes", where each word in an inverted file is referenced
using:
> > > > >
> > > > > <2;(1;6,9),(4;8)>
> > > > >
> > > > > where the (bracketed) terms can be expressed as
> > > > >
> > > > > (x ; y1, y2, y3, ...)
> > > > >
> > > > > where x represents the document in which the word exists, and y
> > > > > represents the word position in the document.
> > > > >
> > > > > The question is, what does the first term in the full structure
> > > > > represent?
> > > >
> > > > I'm totally guessing here, but could it be the word for which the
> > > > positions are indicated?
> > >
> > > Thanks for the reply, but in the book, it gives an example of indexing
> > > using a series of documents, giving the word listing as follows:
> > >
> > > cold - <2;(1;6),(4;8)>
> > > hot - <2;(3;2),(6;2)>
> > > warm - <2;(1;3),(4;4)>
> > > etc...
> > >
> > > As you can see, the first term is always "2", which preceeds the
document
> > > and then the position. Puzzling...
> > Ok. Let's try another guess: the number of documents that the word
occurs
> > in?
>
> Hmmmm - you may have it there. But one question.... what purpose would
that
> serve? It wouldn't really help in relevance ranking or sorting. Seems to
me
> a bizarre thing to have in a string that contains more detailed
information
> about the word.
Most forms of ranking are based on the classic tf.idf ranking, Roughly speaking:
tf(t, d) = term frequency = number of occurrences of term t in document d = N in t (x: y1, y2 .... yN) idf(t) = inverse document frequency = total number of documents/number of documents containing term t = total number of documents/x
tf(t, d) says that if a term t appears frequently in document d, it has a
good
chance of being about t.
idf(t) says that if a term t appears frequently throughout a collection, it
might
not be a particularly good indicator of relevance.
See:
http://www.ftp.cl.cam.ac.uk/ftp/papers/reports/TR356-ksj-approaches-to-text- retrieval.html
for more information of text ranking methods. Received on Thu Oct 12 2000 - 10:02:07 CEST