Re: Large Scale DB Design
Date: Fri, 27 Jul 2001 12:18:29 GMT
Message-ID: <p_c87.445$xA3.389149_at_typhoon.jacksonville.mediaone.net>
sounds like you need a better understanding of a field called "algorithms", aka "analysis of algorithms". searching and comparing is a well studied and well understood area of computer science. one can quantify with a high degree of accuracy the performance of any algorithm provided one has the proper understanding of a mathematical model of the algorithm represented by metrics of computational complexity.
the two factors are time (processing) and space (storage). while there are some proprietary algorithms used in providing working solutions to such problems, there are well undersood and well publicized solutions for such problems. furthermore, most problems have the same solution set as many of them are either isomorphic or homomorphic.
it comes down to a tradeoff between storage and speed in all cases. to reduce processing time, one must increase storage requirements. to reduce storage requirements, one will increase processing time. at the implementation level, relational vs hierarchical vs network is of little consequence since the mathematical models dictate the order of magnitude of the work required to solve the problem.
i expect if you focus your search on "analysis of algorithms" refined to
searching and sorting, you may find more solutions. you might also want to
look at some COTS products designed specifically to solve problems in your
specific problem space. keep in mind relational is not the optimal solution
in many cases. sometimes, flat file structures can outperform relational
strucutres by orders of magnitude. it just depends on the nature of the
specific problem, and you have enumerated several classes of problems, each
of which has it own optimal approach.
"Seth Northrop" <seth_at_northrops.com> wrote in message
news:541e251e.0107261257.3f8fd1b2_at_posting.google.com...
> I'm looking for research materials (books, journals, whitepapers, even
> classes etc.) which deal with the topic of table and data construction for
> various applications ranging from data mining, to trend analysis, to
> decision support, to large scale data comparisons along with the query
> logic behind mining the data.
>
> The key here is that it has to be taken from the constructs of a standard
> SQL environment - we don't have Oracle and we won't be getting Oracle
> anytime soon, so, doing it say within PL/SQL isn't applicable to
> me. Doing it within say standard SQL92 with basic table structure would
> be (we're presently using MySQL - http://www.mysql.com).
>
> I would consider myself advanced to extremely proficient in standard
> relational theory, however, I've found that all of the texts that I've
> tracked down either tailor to the above basics of relational design,
> or, are too theoretically based to ever be applicable within real world
> applications.
>
> The main question that I've failed to see answered within the literature
> I've found to date is, given large scale applications such as the ones
> I'll briefly explain below, how should you most effectively structure your
> data within tables, rows and columns so as to maximize your query's
> effectiveness and efficiency. Mainly reverse indexing techniques (ie, I
> know how to represent RAW data within the confines of tables; but
> getting it out effectively becomes more troubling).
>
> Some applications that I'd like to find some literature on include:
>
> Large Scale Data Comparisons: Taking 1,000's of rows of related data and
> effectively comparing those rows with 1,000's of other rows of related
> data formulating meaningful scoring of the similarities. An example of
> this being a manufacturing flow of procedures that widgets see from
> gestation to release. The idea being to compare the 100's or 1000's of
> little events that occurred to widget A with those that occurred to
> widget B and forming a score of similarity say versus widget C.
>
> Trend analysis: Ie, taking normalized data and developing algorithms, and
> more importantly queryable table structures to extract trends in
> data. Again, an example of this being looking at test data for widget A,
> B, C - and, detecting trends in the test data versus the flow data so as
> to extract logical conclusions of causation.
>
> Decision Support: Based on learned causation amongst results of data
> being able to recommend modifications to flows to artificially create
> alterations to test results.
>
> Effectively Handling Large Amounts of Data: Beyond the simple B-Tree,
> indexing mass amounts of data whether they be textual (keyword indexing),
> or statistical - how to segregate, normalize, and maximize queryability of
> anything from large amounts of X,Y relationships to more complex matrices.
>
> Most of these things can occur at the application level, but, I've found
> that to be particularly troublesome and unscalable - mainly because they
> require full and sometimes multiple table scans to extract all of the data
> so as to facilitate modeling within the application. There has to be a
> smarter way to model the data at the database level before the application
> has to see it.
>
> Obviously, there are no simple answers to these questions. But, I would
> presume that there exists some theoretical backdrop which provides some
> insight into the not so abstract question of, given this problem this is
> how you construct your tables, the data therein, and your key and ultimate
> query logic to extract it back out with the least amount of cost to the
> database.
>
> Any help you be much appreciated!
> Seth
Received on Fri Jul 27 2001 - 14:18:29 CEST