Re: Contextual or Latent Semantic Analysis

From: Marshall Lucas <mlucas_at_liapartners.com>
Date: Thu, 27 Jan 2005 11:46:57 -0600
Message-ID: <mO9Kd.1470$7J.69_at_okepread04>


Neo wrote:

>>From the blog, I was interested by the following statement "... are

> best served by a MultiValue database schema.... A small example would
> be the following nodes (doc1, writer, compiler, programmer; doc2,
> writer, editor, publisher)". Could you explain further? What are you
> trying to do once that data is in the db. Following script models the
> above data with a small experimental db:
>
> // Create items in directory to classify things
> (CREATE *doc .item ~in = dir)
> (CREATE *writer .item ~in = dir)
> (CREATE *compiler .item ~in = dir)
> (CREATE *programmer .item ~in = dir)
> (CREATE *editor .item ~in = dir)
> (CREATE *publisher .item ~in = dir)
> (CREATE *person .item ~in = dir)
>
> // Create persons
> (CREATE *john .cls = person)
> (CREATE *mary .cls = person)
> (CREATE *bob .cls = person)
> (CREATE *joe .cls = person)
> (CREATE *jim .cls = person)
> (CREATE *jack .cls = person)
>
> // Create doc1
> (CREATE *doc1 .cls = doc)
> (CREATE doc1 .writer = john)
> (CREATE doc1 .compiler = mary)
> (CREATE doc1 .programmer = bob)
>
> // Create doc2
> (CREATE *doc2 .cls = doc)
> (CREATE doc2 .writer = joe)
> (CREATE doc2 .editor = jim)
> (CREATE doc2 .publisher = jack)
>
> // Find doc whose publisher is jack.
> // Finds doc2.
> (SELECT %.cls=doc & %.publisher=jack)
>

In a multivalued record, my example would actually be more like:

Record ID=c*1

Field1 (name)   = Demo Corpus
Field2 (docs)   = 2
Field3 (terms)  = 5

Record ID=d*1*doc1
Field1 (terms) = writer]compiler]programmer Field2 (counts) = 1]2]1

Record ID=t*1*writer
Field1 (docs) = doc1]doc2
Field2 (counts) = 1]2

Record ID=t*1*compiler
Field1 (docs) = doc1
Field2 (counts) = 2

Record ID=t*1*programmer
Field1 (docs) = doc1
Field2 (counts) = 1

Record ID=d*1*doc2
Field1 (terms) = writer]editor]publisher Field2 (counts) = 2]1]3

Record ID=t*1*editor
Field1 (docs) = doc2
Field2 (counts) = 1

Record ID=t*1*publisher
Field1 (docs) = doc2
Field2 (counts) = 3

The ] is what multivalue databases call a value mark, it allows multiple items to be stored in a single field. The real kicker here is that if I need to add a new field or a new value within a field, I just add it. No need to restructure the whole database. Those records that have the field added use it. So, when I discovered that I needed to weight the counts I just added Field3 (weights) and calculated the weighted values for each and stored them. Also, as I propogate energy through the nodes I can store the resultant energy signature as Field4 (energy). The first character in each ID is an identifier, c=corpus, t=term, d=document. The number between the *s is the corpus id. This allows me to store multiple corpora in a single table, not to mention, multiple record types like the c*1 record which is a different format from the others. This allows me to copy a single "file" (table) to move my corpora from my test machine to my live machine, and it also allows me to quickly "join" the data without any convoluted SQL join statements.

This represents a graph of documents and terms as nodes that are connected by (counts) energy between them. This graph can be traversed by injected a set amount of energy and applying entropic rules that energy will propogate through the graph and yeild a set of documents that most closely match the node from which the energy was injected. Look for Spreading Activation Graphs described by Scott Preece and the more recent work on Contextual Network Graphs by Maciej Ceglowski. My contribution is merely in the realm of storage and retrieval of the graph nodes and edges.

This style of data storage also allows me to keep a much smaller memory footprint while processing the graph. I can read one node at a time into memory and process it, then move on to the next node. Most other schemas would require the whole graph be loaded into memory as a linked list or large array and then processed each time a query was submitted.   This requires large amounts of memory, whereas my schema uses around 4K max.

Marshall Received on Thu Jan 27 2005 - 18:46:57 CET

Original text of this message