What do you think?

From: John Jacques <jjacwqu1_at_berkshire.rr.com>
Date: 2000/07/29
Message-ID: <3982F308.8E5C8FFF_at_berkshire.rr.com>#1/1


Level_1: Get the data and store it into the database. The data comes from web pages, online books, documents, email, and newsgroups. The main section of data is stored in a blob field. There are 5 other fields to hold info on where the record came from, the subject of the record, who the author is, and auto ID field, and an extra field.

Level_2: the records are cleaned up to remove duplicates, get rid of extra characters (Ex converting %0D%0A to /n and removing extra spaces at the end of the record) that may not be wanted, and to do a bit of organizing by sending the records to their appropriate Database (Ex newsgroups, html, email, doc, etc.), and then to their appropriate Table (Ex howto, automobiles, beverages, etc.).

Level_3: All fields except for the blob field are converted to numbers and the table is converted from all text+1blob to all int+_1blob field. For example all the web pages that came from www.john.com have a #55654 in place of www.john.com. The "www.john.com" is stored into the "from_web" table. This "from_web" table is just a list of domain names with an auto ID field. In this case when "www.john.com" was added to the from_web table, the ID generated was #55654.



Level_4: This is where I am now and it takes a LONG TIME (Many many days) to do this. Before I run the parsing program I was wondering if anyone has heard of this, tried it, or if they have a better way to do it.

I want to create a dictionary database to hold words and phrases. I want to take the blob fields and change every word and/or phrase into a number. Store the word and number into table(s) and remove all extra characters (Ex spaces between words). The data is going to be stored in a comma delimited blob field. So I will take the blob field, convert all words/phrases to a #, and then save the #'s into a comma delimited blob field for later use. ALSO note that the data is now stored into a new database from Level_3 to Level_4. So now I will have 5 int + 1 Blob field and the blob field will be a comma delimited list of numbers.

WHY am I doing this?

  1. I've been doing this sort of thing for years in my spare time, just playing around and having fun.
  2. My goal is to take an infinite amount of data (from anywhere) and store it in an finite storage area. Example, with every LEVEL_N I shrink the storage area required. I think of it as compressing the data. I also remove useless characters very carefully.
  3. SPPPEEEEEDDDDDDD!!!!!! It just keeps getting faster every Level I move up to. Especially for transmitting the data. Having the dictionary at both ends is absolutely amazing! When I started planning this out, I figured the dictionary tables would be enormous, but, they are tiny compared to what I imagined they would be. Take web pages for example, the html tags are almost all redundant over the vast amount of data. User names and emails are repeated over and over again in email and newsgroup messages.
  4. Using compression programs are very slow. Must be all the calculations they do on the data to put it in and to take it out.
  5. My personal favorite! When I want to lookup information it is there and as fast as can be.

Any thoughts or ideas?

Thanks
John Jacques
jjacwqu1_at_berkshire.rr.com Received on Sat Jul 29 2000 - 00:00:00 CEST

Original text of this message