Re: Hashing for DISTINCT or GROUP BY in SQL

From: Cimode <cimode_at_hotmail.com>
Date: Fri, 15 Oct 2010 16:38:33 -0700 (PDT)
Message-ID: <fa9f6c6a-faf9-4cd1-8d3c-5eb685db4ed1_at_30g2000yqm.googlegroups.com>


On 15 oct, 23:36, paul c <anonym..._at_not-for-mail.invalid> wrote:
> On 15/10/2010 1:26 PM, Cimode wrote:
>
> > Your comments reminded me of what initially triggered my interest in
> > databases in the first place: the proto-index structure logic behind
> > the early VSAM systems.  At that time, the initial direct image data
> > files were called clusters.  Not much have changed since, regarding
> > database implementations.
>
> It does seem that physical notions still impede people some or all of
> the time.  When the only access methods I knew were Qsam, Bdam and Isam,
> a boss sent me to a course on the then-new Vsam.  When I returned to the
> office a week later he asked me about it and I told him it would never last.
>
> Part of the motivation for IBM's vsam had to do with the mess that their
> disk architecture had created.  What started as their big OS/360
> (lovingly referred to be some users as 'Obstacle System 360') turned
> into MFT or somesuch and then MVS.  It had a 'system catalog' which had
> nothing to do with any database principles whatsover, being just a bunch
> of adhoc conventions.  Vsam became the vehicle for a replacement
> catalog.  I believe the clusters were really intended as a way to
> aggregate several disks to give a larger 'virtual' disk.  But the
> problem remained that people still thought in terms of that disk
> 'device', no matter that is was 'virtual'.
>
> The 'disk architecture' that advented with the introduction of System
> 360 was called 'count-key-data' aka CKD.  It was interesting in that one
> could program at the peripheral level, outboard of the mainframe cpu,
> but that focus on individual devices was an insidious impediment to real
> system thinking (which I think is what is needed for real database
> thinking) in the sense that a coherent theory is needed.
>
> The craziest extreme of CKD might have been the 2321 data cell drive.
> The point to the 'K' or 'key' part of the acronym was that 'keys' were a
> distinct portion of a disk's surface and you could write what was called
> a 'channel program' which would apply only to one disk and not tie up
> the channel through which other disks were connected.  I don't remember
> tape drives having 'key' support but even though the 2321 used tape, it
> supported CKD and looked to the programmer as a disk drive.
>
> Wang had a superior kind of 'vsam', forget the name, which had
> transactions.  I learned it in about two hours, it was that well documented.
>
> In the 1970's IBM brought out a system called the S/38.  It was really
> radical, having a linear addressing scheme that merged memory and
> whatever devices were attached, so the fixation on individual devices
> was ignored.  Apparently internal politics at IBM limited the size of
> the S/38 so that it wouldn't compete with the big mainframes.  Customer
> operations staff used to need to go to IBM courses to learn what
> operating system options needed to be toggled to enable the addition of
> peripherals.  Some other companies like Burroughs had machines even
> before then that required no software changes to attach a new disk
> drive, in some cases not even down time so Burroughs didn't make any
> money charging for courses to learn how to do that.  Wasn't Burroughs
> stupid!
>
> In recent years there was quite a lot of 'buzz' in the Linux world about
> something called the Reiser FS.  It had aspects that confused logical
> with physical, which worried me.  Since the principle author has gone to
> jail it seems that development has stalled and maybe I don't need to
> worry anymore about that.
Hi paul

Thanks for elaborating about the historic context. I believe you have misunderstood the point I am trying to make. This is not about physical vs logical confusion or a database theory vs database implementation confusion. It is about defining the math the physical must satisfy to effectively support the logical.

I can conceive that VSAM is not significant for somebody observing database science from a purely logical perspective of relation operation and structural definition. But logical database science is not all database science. The history of theory behind physical implementations following the development of database science is just as significant. I suspect that it is the inability of logical theorists to conceive mathematical models that could allow implementations of higher abstraction logical principles on binary current mechanized addressing schemes that explains a part the failure of the database science as a whole in contemporary times.

For instance, logical theorists I sometime find naive when they assume no math could exists behind the algorithms of database implementations attempts. And I find them even more naive, when they believe that the lack of such math can not have an impact on the bias of how the logical database science is conceivable. As for me, I can not dissociate one from the other: though the logical(RM) dictates the intent, the physical dictates the method to respect the intent.

I consider VSAM significant in physical database implementation perspective and the lower level theory behind it, as compared to previous systems. In previous systems, seek algorithms were mostly relying on run time sorting and linear dichotomic searches. VSAM introduced a sophisticated seeking scheme based the concept of dichotomic leaf search aka *register zig zags* which is widely implemented on direct image systems. The algorithms probably inspired by fractal theory allowed an order of magnitude reduction of number of logical IO's necessary to reach a specific pre-determined value. In a sense, the clusters were not *just* a dumb stack of files, but were also the ancestors of today's ordered indexes (known as clustered indexes) frequently used on SQL systems.

The logic behind the data structure also relies on the concept of pre order presenting similarities with latest transrelational model. I believe somehow that this constitutes an evolution that can't be neglected since it allowed to conceive as possible the implementation of number of operators such as NOT EXISTS that were previously considered resource prohibitive under ISAM systems. Received on Sat Oct 16 2010 - 01:38:33 CEST

Original text of this message