Re: Loading 300 million records

From: AlexK <alexk84_at_hotmail.com>
Date: 13 Jun 2005 10:30:58 -0700
Message-ID: <1118683858.756353.223960@g43g2000cwa.googlegroups.com>

Thanks a lot for all the info so far. First off, let me start by saying that I am a developer, not a DBA, so some of the things that are being mentioned are outside of my scope of knowledge. I did not provide additional information about the config of the instance, because I was not aware that it was needed. I might be able to get some additional info, but unfortunately I do not have access to unix or enterprise manager and will not be able to get this access. If there is something that I can query from the data dictionary or somewhere from the db that doesn't require DBA privileges, I will try to provide that info. Please just let me know what info would be useful.

As some people have mentioned in the thread, they would like to know more about the process. Let me try and provide a bit more info and see if it helps. All the procedure does is perform an insert into .. select. The select portion of the query pics out about 25 columns which is a subset of all the columns. It also performs a couple of decodes, a substr, an nvl, a trunc, and some simple arithmetic. The where portion of the select query is joining 5 tables, two of these joins are outer joins. One of these two outer tables contains about 15 records and the other about 9000 records. The main driving table has about 300 million records in it. Aside from the two outer joins mentioned above, there are some filter conditions applied and two additional regular joins. These last two joins are joining to two tables containing 5.5 million and 15000 records respectively. The table containing 300 million records is partitioned. The target table is not partitioned and has no indexes. When this process is run, I do not believe any other processes are utilizing these tables. All tables are on the same database, nothing is being accessed across a dblink.

Daniel had made a mention of an article on Oracle's website. I downloaded the article and read it. I found it quite informative and it taught me some new stuff. Thanks Daniel for recommending it. It seemed to make the same reference that all the other articles and postings I've read. It states that programmatic logic using bulk collects would be slower than a SQL statement alone. Does that apply in all scenarios? In other words, does that statement hold true regardless of whether you're loading 50,000 records or 500 million? I did try to recreate some of the logic that was discussed in the article. The code is essentially the same code Daniel had posted in a follow up post. I have only run it on a subset of data, approximatly 3 million records and it seemed to run in about the same time as did the old code on the same sample set.

A view was also mentioned in the posts. This data is being used by a web application that performs some analytics on it. I'm not sure of the details of the app. The data needs to be in a table as the source tables do change at times. If the table was a view, the results generated by the web app could change which is not what the users want. Also, I think the query runs too slowly to be used as a view.

I must admit that this is my first experience working with a data warehouse this large. As such, I'm not sure if it is normal for it to take 10 or more hours to load this data. Has anyone worked with such large data sets? Have they seen similar timings? Am I expecting too much? I understand that many factors are influenced by the configuration of the database and the hardware it runs on. Unfortunately my access to that hardware and its configuration is non-existant. I really do not know much about it. I believe the physical machine has 12 CPUs and logically 3 are allocated to this db. Not sure how much ram is on it and I believe it is running AIX. That is all I know.

Finally, to address some comments Billy made. I know the select part of the query runs slowly, but I do not know if the insert is running slowly as well. I'm not sure how to test that. If you have any advice as to how to give you more info, I would be glad to do it. Also, you had made some references to PQ, CTAS, and VLT. Correct me if I'm wrong, but I'm assuming PQ is Parallel Query and VLT is Virtual Logical Table. What is CTAS, I'm not quite familiar with that acronym.

I would just like to reiterate that I appreciate everyone's help on the matter so far.

Thanks,

Alex Received on Mon Jun 13 2005 - 12:30:58 CDT