Large Scale DB Design
Date: 26 Jul 2001 13:57:56 -0700
Message-ID: <541e251e.0107261257.3f8fd1b2_at_posting.google.com>
I'm looking for research materials (books, journals, whitepapers, even
classes etc.) which deal with the topic of table and data construction for
various applications ranging from data mining, to trend analysis, to
decision support, to large scale data comparisons along with the query
logic behind mining the data.
The key here is that it has to be taken from the constructs of a standard
SQL environment - we don't have Oracle and we won't be getting Oracle
anytime soon, so, doing it say within PL/SQL isn't applicable to
me. Doing it within say standard SQL92 with basic table structure would
be (we're presently using MySQL - http://www.mysql.com).
I would consider myself advanced to extremely proficient in standard
relational theory, however, I've found that all of the texts that I've
tracked down either tailor to the above basics of relational design,
or, are too theoretically based to ever be applicable within real world
applications.
The main question that I've failed to see answered within the literature
I've found to date is, given large scale applications such as the ones
I'll briefly explain below, how should you most effectively structure your
data within tables, rows and columns so as to maximize your query's
effectiveness and efficiency. Mainly reverse indexing techniques (ie, I
know how to represent RAW data within the confines of tables; but
getting it out effectively becomes more troubling).
Some applications that I'd like to find some literature on include:
Large Scale Data Comparisons: Taking 1,000's of rows of related data and
effectively comparing those rows with 1,000's of other rows of related
data formulating meaningful scoring of the similarities. An example of
this being a manufacturing flow of procedures that widgets see from
gestation to release. The idea being to compare the 100's or 1000's of
little events that occurred to widget A with those that occurred to
widget B and forming a score of similarity say versus widget C.
Trend analysis: Ie, taking normalized data and developing algorithms, and
more importantly queryable table structures to extract trends in
data. Again, an example of this being looking at test data for widget A,
B, C - and, detecting trends in the test data versus the flow data so as
to extract logical conclusions of causation.
Decision Support: Based on learned causation amongst results of data
being able to recommend modifications to flows to artificially create
alterations to test results.
Effectively Handling Large Amounts of Data: Beyond the simple B-Tree,
indexing mass amounts of data whether they be textual (keyword indexing),
or statistical - how to segregate, normalize, and maximize queryability of
anything from large amounts of X,Y relationships to more complex matrices.
Most of these things can occur at the application level, but, I've found
that to be particularly troublesome and unscalable - mainly because they
require full and sometimes multiple table scans to extract all of the data
so as to facilitate modeling within the application. There has to be a
smarter way to model the data at the database level before the application
has to see it.
Obviously, there are no simple answers to these questions. But, I would
presume that there exists some theoretical backdrop which provides some
insight into the not so abstract question of, given this problem this is
how you construct your tables, the data therein, and your key and ultimate
query logic to extract it back out with the least amount of cost to the
database.
Any help you be much appreciated!