Large Scale DB Design

From: Seth Northrop <seth_at_northrops.com>
Date: 26 Jul 2001 13:57:56 -0700
Message-ID: <541e251e.0107261257.3f8fd1b2_at_posting.google.com>


I'm looking for research materials (books, journals, whitepapers, even classes etc.) which deal with the topic of table and data construction for various applications ranging from data mining, to trend analysis, to decision support, to large scale data comparisons along with the query logic behind mining the data.

The key here is that it has to be taken from the constructs of a standard SQL environment - we don't have Oracle and we won't be getting Oracle anytime soon, so, doing it say within PL/SQL isn't applicable to me. Doing it within say standard SQL92 with basic table structure would be (we're presently using MySQL - http://www.mysql.com).

I would consider myself advanced to extremely proficient in standard relational theory, however, I've found that all of the texts that I've tracked down either tailor to the above basics of relational design, or, are too theoretically based to ever be applicable within real world applications.

The main question that I've failed to see answered within the literature I've found to date is, given large scale applications such as the ones I'll briefly explain below, how should you most effectively structure your data within tables, rows and columns so as to maximize your query's effectiveness and efficiency. Mainly reverse indexing techniques (ie, I know how to represent RAW data within the confines of tables; but getting it out effectively becomes more troubling).

Some applications that I'd like to find some literature on include:

Large Scale Data Comparisons: Taking 1,000's of rows of related data and effectively comparing those rows with 1,000's of other rows of related data formulating meaningful scoring of the similarities. An example of this being a manufacturing flow of procedures that widgets see from gestation to release. The idea being to compare the 100's or 1000's of little events that occurred to widget A with those that occurred to widget B and forming a score of similarity say versus widget C.

Trend analysis: Ie, taking normalized data and developing algorithms, and more importantly queryable table structures to extract trends in data. Again, an example of this being looking at test data for widget A, B, C - and, detecting trends in the test data versus the flow data so as to extract logical conclusions of causation.

Decision Support: Based on learned causation amongst results of data being able to recommend modifications to flows to artificially create alterations to test results.

Effectively Handling Large Amounts of Data: Beyond the simple B-Tree, indexing mass amounts of data whether they be textual (keyword indexing), or statistical - how to segregate, normalize, and maximize queryability of anything from large amounts of X,Y relationships to more complex matrices.

Most of these things can occur at the application level, but, I've found that to be particularly troublesome and unscalable - mainly because they require full and sometimes multiple table scans to extract all of the data so as to facilitate modeling within the application. There has to be a smarter way to model the data at the database level before the application has to see it.

Obviously, there are no simple answers to these questions. But, I would presume that there exists some theoretical backdrop which provides some insight into the not so abstract question of, given this problem this is how you construct your tables, the data therein, and your key and ultimate query logic to extract it back out with the least amount of cost to the database.

Any help you be much appreciated!
Seth Received on Thu Jul 26 2001 - 22:57:56 CEST

Original text of this message