Re: algorithms for selecting on non-key/non-indexed field

From: dfdf <adamsch1NOadSPAM_at_yahoo.com.invalid>
Date: 2000/07/01
Message-ID: <00822ad8.72ac498c_at_usw-ex0103-086.remarq.com>#1/1


Marc Tardif <intmktg_at_Gloria.CAM.ORG> wrote:
>> If your access method is of type heap, then in order to avoid
>> linear scans, you should create seperate, appropriate indexes
 on
>> the fields you will commonly search on. In my experiment, I
>> created a heap file of about 1 million rows. Doing a linear
>> scan (where each row has about 64 bytes of *my* data) to find
 a
>> particular row based on a non-indexed field takes approx 18
>> seconds.
>
>There are various ways of representing a heap, such as a tree
 or an array
>which lend to very different implementations. As for a linear
 scan, be it
>for a straight text file or a database, there are many
 algorithms for this
>seemingly simple task: brute force, Knuth-Morris-Pratt, Aho-
 Corasick,
>Shift-Or, etc. Furthermore, your first paragraph mentions
Thanks. I'm not well read in theory outside the courses I had to take in college. I've put these aside and will do some more research.

Berkeley DB
>which only supports key/data pairs whereas your second
 paragraph mentions
>"fields" which are only supported by more advanced relational
databases.
Sorry: To be precise, my "table" consisted of rows that internally are represented by a C-structure that contains a couple of integers. One of the ints in the structure was what BDB was using as an index. My selection criteria was off the other int (which I referred to as a field), resulting in my brute force method for searching.

>As you can understand, I am very confused at this point. Be
 more precise.
>
>> In practice, mysql is probably doing something clever that I'm
>> not. I would like any comments on the clever bits in
>> particular. Looking at the source for both mysql and msql, it
>> appears that they create in-memory indexes behind the scenes
 to
>> help you out, or possibly gather statistical information on
>> field contents to get an approximation when doing the select.
>
>Congrats on being able to understand the code of both mysql and
 msql. You
>must've noticed that msql abuses memory usage quite a bit,
 making it
>invariably quite fast but only until you start running out of
 memory. If
>you want to use Berkeley DB, I'd advise against trying to
 simulate
>relational databases (if mysql and msql qualify), but rather to
>concentrate on the task at hand and using the proper access
 methods for
>the proper job.

Interesting commentary. I would lean toward a similar argument, however to avoid starting a flame war with someone who is really religious about mysql/msql, I'll just focus on your comment about BDB not being proper. I was looking at the source for mysql, and found that it uses a set of file primitives that closely match BDB. I also took a look at my old school text (Fundamentals of Database Systems, Elmarsi/Navathe) and they mention a similar set of file primitives as a basis.

My first question: Why would BDB not be appropriate? Is this an architectural issue - say sophisticated buffering or the API is too generic ? It would seem that for a relational database, one would need a set of primitives for files consisting of Fixed-  Records, Variable-length Records, and B-Tree's in order to provide a solution to your queries.

>
>> This is somewhat of a badly formed question, so I understand
>> that I may not get any responses!
>
>Since you're so keen on databases, consider using a linked-list
 of
>questions instead of your previous randomness. Start from the
 root node
>with a simple question which you are more likely to have
 answered. Then,
>progress down your list of questions which could change as you
 receive new
>answers... that could qualify as a self-adjusting data
 structure, grounds
>for a long thread of discussion.

There ya go. I'll start out with one question. Technically though wouldn't it be a tree instead of a list? =)

>
>
>


Got questions? Get answers over the phone at Keen.com. Up to 100 minutes free!
http://www.keen.com Received on Sat Jul 01 2000 - 00:00:00 CEST

Original text of this message