Re: algorithms for selecting on non-key/non-indexed field
Date: 2000/07/01
Message-ID: <Pine.LNX.4.10.10007012123180.25051-100000_at_Gloria.CAM.ORG>#1/1
> If your access method is of type heap, then in order to avoid
> linear scans, you should create seperate, appropriate indexes on
> the fields you will commonly search on. In my experiment, I
> created a heap file of about 1 million rows. Doing a linear
> scan (where each row has about 64 bytes of *my* data) to find a
> particular row based on a non-indexed field takes approx 18
> seconds.
There are various ways of representing a heap, such as a tree or an array which lend to very different implementations. As for a linear scan, be it for a straight text file or a database, there are many algorithms for this seemingly simple task: brute force, Knuth-Morris-Pratt, Aho-Corasick, Shift-Or, etc. Furthermore, your first paragraph mentions Berkeley DB which only supports key/data pairs whereas your second paragraph mentions "fields" which are only supported by more advanced relational databases. As you can understand, I am very confused at this point. Be more precise.
> In practice, mysql is probably doing something clever that I'm
> not. I would like any comments on the clever bits in
> particular. Looking at the source for both mysql and msql, it
> appears that they create in-memory indexes behind the scenes to
> help you out, or possibly gather statistical information on
> field contents to get an approximation when doing the select.
Congrats on being able to understand the code of both mysql and msql. You must've noticed that msql abuses memory usage quite a bit, making it invariably quite fast but only until you start running out of memory. If you want to use Berkeley DB, I'd advise against trying to simulate relational databases (if mysql and msql qualify), but rather to concentrate on the task at hand and using the proper access methods for the proper job.
> This is somewhat of a badly formed question, so I understand
> that I may not get any responses!
Since you're so keen on databases, consider using a linked-list of questions instead of your previous randomness. Start from the root node with a simple question which you are more likely to have answered. Then, progress down your list of questions which could change as you receive new answers... that could qualify as a self-adjusting data structure, grounds for a long thread of discussion. Received on Sat Jul 01 2000 - 00:00:00 CEST