Re: Rants. Difficulty to learn ETL tools?
Date: 16 May 2007 10:00:21 -0700
Message-ID: <1179334821.527214.218250_at_p77g2000hsh.googlegroups.com>
[Quoted] Hi Araron,
"What can it do that you can't do in PL/SQL? Well, some nice things
are
we can parallelise the processing of large numbers of fact records
and
we can put the dimension tables in memory mapped IO and access them
in
a shared fashion using binary search......this is 10x faster than
doing the same in PL/SQL at runtime.... "
The most cpu intensive task in building a dimensional model is the tranlsation of real keys into integer keys...the attribution process.
Doing this work in PL/SQL or inside the database as ELT type tools do consumes much more (10x or more) processing power than is required if you do it more efficiently.....and since this is the most expensive bit, it's worth speeding up in large accounts...
Yes, of course, you can run multiple pl/sql statements at the same time to do this work if you would like.
However, one thing we have been doing for about 7 years how is putting a unique key on the front of fact table records. And when processing a single set of transactions that have been split into different files to enable parallelism you have to somehow allow for the allocation of keys for these unqiue ids on the front of the fact records...this means a semaphor at the file level, locks etc, so the pl/sql to handle that becomes more difficult...especially when you consider you cannot just allocate one key at a time and do the update because it slows things down too much. You must be able to allocate blocks of keys.......the ETL tools have these features in place because they are so necessary.
Many people say to me, well, just let the database allocate a sequence number....but in most databases these are slow....and they also give you trouble if you want to delete and reload data because it is often not so easy to reset the sequence number so that the delete/reloaded records get the same sequence number they used to have......
We also do things like...when we need to update records it is often faster to delete them and then to use the loader to load the record rather than do an update. In these cases the sequence numbers must be respected so using the database to allocate them does not work if you want to do a delete/load rather than an update.
And yes, PL/SQL can do a good part of all this if you want to write it...but it is far from trivial......and it will run 10x or more slower than doing it properly....and that is well worth the effort in most sizable accounts.....plenty of DW projects fail because the ETL subsystem is too slow, too expensive, and too difficult to change.
Best Regards
Peter
On May 4, 10:13 pm, "Aaron Kempf" <ake..._at_dol.wa.gov> wrote:
> what you can't run multiple PL/SQL statements at the same time in Oracle?
>
> ROFL
>
> "Peter Nolan" <p..._at_peternolan.com> wrote in message
>
> news:1177319493.688345.26010_at_o5g2000hsb.googlegroups.com...
>
>
>
> > Hi DBA...
> > your append is exactly what I have been talking about since the mid
> > 90s as well....making ETL easier......
>
> > We have invented the future and the future of ETL is 'generated ETL
> > from the data mapping workbook'. (www.instantbi.com)
>
> > You have to do your data mapping somehow, and excel is how most people
> > do it, the laggards are still using word......
>
> > Since you already have to do your data mapping, and if you are
> > sensible you do it in excel, it makes the most sense to generate the
> > etl subsystem directly from the workbook as well as publish the
> > workbook via the web so that authorised people can see any and all
> > details of the ETL subsystem.
>
> > No ETL subsystem will ever be any easier to develop and deploy than
> > what we have invented because no ETL subsystem will ever be easier to
> > build than a direct generate from the mapping workbook.....this is the
> > 'end game' for development of ETL subsystems.
>
> > Why use such a tool rather than PL/SQL.....well, because it is
> > generated directly from the workbook we have 'done away with' the ETL
> > programmer.....and that is a good thing. I have done far too much ETL
> > programming over the years and I want to get rid of that complete
> > waste of time....
>
> > What can it do that you can't do in PL/SQL? Well, some nice things are
> > we can parallelise the processing of large numbers of fact records and
> > we can put the dimension tables in memory mapped IO and access them in
> > a shared fashion using binary search......this is 10x faster than
> > doing the same in PL/SQL at runtime....
>
> > Also, we have intelligence built into it that means you can do things
> > like add new summaries without any code changes, you can add new keys
> > to fact tables without any code changes, you can make lookups into
> > dimension tables to get new keys FAR more complex than possible than
> > via normal sql statements.
>
> > In short, we have eliminated all the 'coding' effort that is required
> > when writing you ETL subsystem no matter what the tool.....and we have
> > done it in such a way that it is as scalable as the operating system
> > underneath....
>
> > Another BIG feature is that the ETL subsystem is portable across
> > databases and operating systems....something that PL/SQL and SSIS are
> > obvioulsy not.....this means that if some better/faster database comes
> > out we can move to it......not something that MSFT would like to her
> > and this is their newsgroup.....but it has always been a belief of
> > mine that the ETL subsystem should be fully portable across OS and
> > database.....and surprisingly, this is NOT the case with any of the
> > ETL tools that I have seen....they all require quite some effort to
> > move them.....thereby creating a cost to adopt a faster/cheaper/better
> > database.
>
> > If you are keenly interested, feel free to check my personal site
> >www.peternolan.comwhere I have published vast amounts of code and
> > documentation on ETL subsystems.
>
> > Best Regards
>
> > Peter
> >www.peternolan.com- Hide quoted text -
>
> - Show quoted text -
Received on Wed May 16 2007 - 19:00:21 CEST