Home » SQL & PL/SQL » SQL & PL/SQL » Large volume data compare - NEED HELP (Oracle 11)
Large volume data compare - NEED HELP [message #665643] Wed, 13 September 2017 22:06 Go to next message
Member2014
Messages: 33
Registered: July 2014
Location: USA
Member
Hello,

I need help with a solution we are trying to build. We have a process which drops and recreates 2 large tables (each contains about 200MM records and about 800 columns) daily.
I need to extract records from each of the tables daily after they are built if a subset of column values for the primary key has changed. The two issues I have is:
1. Since its a drop and recreate, we currently do not have daily backups of the table to compare
2. its a very large volume and so not sure the best way to compare data

Below is a sample of one of the tables.
table name: 
create table name_address
(id varchar2(100),
fname varchar2(100),
lname varchar2(200),
address1 varchar2(500),
address2 varchar2(500),
city varchar2(50),
state varchar2(50),
zip varchar2(50));

Extract layout is as below. We need to pull records in the file for any ID's when there is a change in city or state or zip.
The column action indicates I - if its a new record U - if any of the values indicated above have been updated
id, fname, lname, city, state, zip, action

I will need to do the following -
1. After the extract for a given day is completed, create a table daily with the subset of columns needed for comparison say prev_name_address
2. Next day once the table has been built compare the two tables prev_name_address and name_address against specific columns (city/ state/ zip) to determine the new/ updated records 

My problem areas -
1. What is the best way to back up a subset of data for a table which is so large (~200MM records with ~700 columns)
2. What is the best way to compare the backup vs production table to identify the inserts/updates based on the city/ state/ zip value changes

Any help is greatly appreciated. Thank you in advance.

Large volume data compare - NEED HELP [message #665644 is a reply to message #665643] Wed, 13 September 2017 22:06 Go to previous messageGo to next message
Member2014
Messages: 33
Registered: July 2014
Location: USA
Member
Hello,

I need help with a solution we are trying to build. We have a process which drops and recreates 2 large tables (each contains about 200MM records and about 800 columns) daily.
I need to extract records from each of the tables daily after they are built if a subset of column values for the primary key has changed. The two issues I have is:
1. Since its a drop and recreate, we currently do not have daily backups of the table to compare
2. its a very large volume and so not sure the best way to compare data

Below is a sample of one of the tables.
table name: 
create table name_address
(id varchar2(100),
fname varchar2(100),
lname varchar2(200),
address1 varchar2(500),
address2 varchar2(500),
city varchar2(50),
state varchar2(50),
zip varchar2(50));

Extract layout is as below. We need to pull records in the file for any ID's when there is a change in city or state or zip.
The column action indicates I - if its a new record U - if any of the values indicated above have been updated
id, fname, lname, city, state, zip, action

I will need to do the following -
1. After the extract for a given day is completed, create a table daily with the subset of columns needed for comparison say prev_name_address
2. Next day once the table has been built compare the two tables prev_name_address and name_address against specific columns (city/ state/ zip) to determine the new/ updated records 

My problem areas -
1. What is the best way to back up a subset of data for a table which is so large (~200MM records with ~700 columns)
2. What is the best way to compare the backup vs production table to identify the inserts/updates based on the city/ state/ zip value changes

Any help is greatly appreciated. Thank you in advance.

Re: Large volume data compare - NEED HELP [message #665646 is a reply to message #665644] Wed, 13 September 2017 22:28 Go to previous messageGo to next message
BlackSwan
Messages: 26766
Registered: January 2009
Location: SoCal
Senior Member
I may just be old & senile but I could benefit from clarifications.
BTW - tables don't contain any records; only rows.
1) in integer value what does 200MM equal?
2) on "drop & recreate" what is the data source for the data values. (Why not TRUNCATE instead?)
3) You say that 2 large tables are involved, but I am unclear what the difference is between the 2 tables. (Why 2 tables?)
4) AFAIK SQL does not have any "extract" statement; so what exactly needs to be done.
5) I am unclear what is being compared to what?

A wise man once gave me the following advice.
First make it work, then make it fancy or fast.

If you post all the necessary SQL that successfully completes the task, I am sure you will get ideas on how to make it faster!
Re: Large volume data compare - NEED HELP [message #665647 is a reply to message #665646] Wed, 13 September 2017 22:56 Go to previous messageGo to next message
Member2014
Messages: 33
Registered: July 2014
Location: USA
Member
Apologize if its a little unclear. I'm still trying to figure out how to implement but hope this helps -

1) in integer value what does 200MM equal?
200,000,000 i.e. 200 million
2) on "drop & recreate" what is the data source for the data values. (Why not TRUNCATE instead?)
We have multiple data sources used to build this table and a series of logic involved. The process i inherited had been built to drop an create. I'm not sure what the though process was when this was built. Definately an optimization that can be looked into
3) You say that 2 large tables are involved, but I am unclear what the difference is between the 2 tables. (Why 2 tables?)
One table contains name/address data and the other email. For the purpose of this example I have given a sample of one of the tables. Once I have the logic for one of the tables it should be a simple replication for the other
4) AFAIK SQL does not have any "extract" statement; so what exactly needs to be done.
5) I am unclear what is being compared to what?
I have put in code for a simple way to generate the extract. This will generate the result but probably not the most optimized way.

1. Back up table with necessary columns to compare (day 1)
CREATE TABLE PREV_NAME_ADDRESS PARALLEL 4
AS SELECT id, fname, lname, city, state, zip FROM NAME_ADDRESS;

2. Once the NAME_ADDRESS table is built (day 2) compare the table with PREV_NAME_ADDRESS on a subset of columns i.e. city/ state/ zip
CREATE TABLE NAME_ADDRESS_EXTRACT PARALLEL 4 
AS SELECT id, fname, lname, city, state, zip,
	CASE WHEN n.id IS NOT NULL AND o.id IS NULL THEN 'I' ELSE 'U' END IF AS action
FROM NAME_ADDRESS n 
LEFT OUTER JOIN PREV_NAME_ADDRESS o
ON n.id = o.id
WHERE n.city <> o.city
AND n.state <> o.state
AND n.zip <> o.zip;

3. Generate an extract of the NAME_ADDRESS_EXTRACT table once its built to a flat file
 SELECT * FROM NAME_ADDRESS_EXTRACT;

The logic for populating the "action" column I need to test.

Please note that the name_address table I have built here is just a subset of the table we have on the DB. My biggest concern is the comparison of the columns between name_address and prev_name_address.
Here I have shown 3 columns to be compared out of 8 columns. In my DB table I need to compare 18 columns out of about 700 columns.

Is there a better way like adding indices for faster comparison, hash function, or something else which will make the comparison more efficient and faster?

Thank you again for looking into it.
Re: Large volume data compare - NEED HELP [message #665648 is a reply to message #665647] Wed, 13 September 2017 23:14 Go to previous messageGo to next message
BlackSwan
Messages: 26766
Registered: January 2009
Location: SoCal
Senior Member
I admit I am just dense.
What is being compared to what?
What constitute a match?
What is a mis-match?
Which is more important & what should be done upon detection?

Please stop obsessing on "efficiency"!
If you can't do basic detection, you have nothing to "make faster".
Re: Large volume data compare - NEED HELP [message #665656 is a reply to message #665648] Thu, 14 September 2017 03:37 Go to previous messageGo to next message
cookiemonster
Messages: 13917
Registered: September 2008
Location: Rainy Manchester
Senior Member
It'd probably help if you described the business process you're trying to implement, rather than what this particular implementation is trying to do.
There may well be easier/ more efficient ways than creating tables on the fly every day.
Re: Large volume data compare - NEED HELP [message #665657 is a reply to message #665648] Thu, 14 September 2017 03:44 Go to previous messageGo to next message
quirks
Messages: 82
Registered: October 2014
Member
CREATE TABLE NAME_ADDRESS(
    ID VARCHAR2(100)
   ,FNAME VARCHAR2(100)
   ,LNAME VARCHAR2(200)
   ,ADDRESS1 VARCHAR2(500)
   ,ADDRESS2 VARCHAR2(500)
   ,CITY VARCHAR2(50)
   ,STATE VARCHAR2(50)
   ,ZIP VARCHAR2(50)
);

INSERT INTO NAME_ADDRESS
     VALUES (
                '01'
               ,'01 first name'
               ,'01 last_name'
               ,'01 address1'
               ,'01 address2'
               ,'01 city'
               ,'01 state'
               ,'01 zip'
            );

INSERT INTO NAME_ADDRESS
     VALUES (
                '02'
               ,'02 first name'
               ,'02 last_name'
               ,'02 address1'
               ,'02 address2'
               ,'02 city'
               ,'02 state'
               ,'02 zip'
            );

INSERT INTO NAME_ADDRESS
     VALUES (
                '03'
               ,'03 first name'
               ,'03 last_name'
               ,'03 address1'
               ,'03 address2'
               ,'03 city'
               ,'03 state'
               ,'03 zip'
            );

INSERT INTO NAME_ADDRESS
     VALUES (
                '04'
               ,'04 first name'
               ,'04 last_name'
               ,'04 address1'
               ,'04 address2'
               ,'04 city'
               ,'04 state'
               ,'04 zip'
            );

COMMIT;

CREATE TABLE PREV_NAME_ADDRESS
PARALLEL 4
AS
    SELECT NAME_ADDRESS.ID
          ,NAME_ADDRESS.FNAME
          ,NAME_ADDRESS.LNAME
          ,NAME_ADDRESS.ADDRESS1
          ,NAME_ADDRESS.ADDRESS2
          ,NAME_ADDRESS.CITY
          ,NAME_ADDRESS.STATE
          ,NAME_ADDRESS.ZIP
          ,'NONE' AS ACTION
      FROM NAME_ADDRESS;

DELETE FROM NAME_ADDRESS
      WHERE ID = '03';

UPDATE NAME_ADDRESS
   SET CITY = 'canged city'
 WHERE ID = '01';

UPDATE NAME_ADDRESS
   SET ADDRESS1 = 'changed address1'
 WHERE ID = '02';

INSERT INTO NAME_ADDRESS
     VALUES (
                '05'
               ,'05 first name'
               ,'05 last_name'
               ,'05 address1'
               ,'05 address2'
               ,'05 city'
               ,'05 state'
               ,'05 zip'
            );

COMMIT;

SELECT COALESCE(NEW_VALS.ID, OLD_VALS.ID) AS ID
      ,COALESCE(NEW_VALS.FNAME, OLD_VALS.FNAME) AS FNAME
      ,COALESCE(NEW_VALS.LNAME, OLD_VALS.LNAME) AS LNAME
      ,COALESCE(NEW_VALS.ADDRESS1, OLD_VALS.ADDRESS1) AS ADDRESS1
      ,COALESCE(NEW_VALS.ADDRESS2, OLD_VALS.ADDRESS2) AS ADDRESS2
      ,COALESCE(NEW_VALS.CITY, OLD_VALS.CITY) AS CITY
      ,COALESCE(NEW_VALS.STATE, OLD_VALS.STATE) AS STATE
      ,COALESCE(NEW_VALS.ZIP, OLD_VALS.ZIP) AS ZIP
      ,CASE
           WHEN NEW_VALS.ID IS NULL
           THEN
               'D'
           WHEN OLD_VALS.ID IS NULL
           THEN
               'I'
           WHEN    NEW_VALS.ID
--                || NEW_VALS.FNAME
--                || NEW_VALS.LNAME
--                || NEW_VALS.ADDRESS1
--                || NEW_VALS.ADDRESS2
                || NEW_VALS.CITY
                || NEW_VALS.STATE
                || NEW_VALS.ZIP =
                   OLD_VALS.ID
--                || OLD_VALS.FNAME
--                || OLD_VALS.LNAME
--                || OLD_VALS.ADDRESS1
--                || OLD_VALS.ADDRESS2
                || OLD_VALS.CITY
                || OLD_VALS.STATE
                || OLD_VALS.ZIP
           THEN
               'NONE'
           ELSE
               'U'
       END
           AS ACTION
  FROM NAME_ADDRESS NEW_VALS FULL OUTER JOIN PREV_NAME_ADDRESS OLD_VALS ON (NEW_VALS.ID = OLD_VALS.ID);

[Updated on: Thu, 14 September 2017 03:46]

Report message to a moderator

Re: Large volume data compare - NEED HELP [message #665659 is a reply to message #665657] Thu, 14 September 2017 05:33 Go to previous messageGo to next message
John Watson
Messages: 8922
Registered: January 2010
Location: Global Village
Senior Member
If all you want to do is capture changes, it might be simpler to write a trigger to do this. I'm no great fan of triggers, but this could be a reasonable usage case.
Re: Large volume data compare - NEED HELP [message #665672 is a reply to message #665659] Thu, 14 September 2017 13:14 Go to previous messageGo to next message
Member2014
Messages: 33
Registered: July 2014
Location: USA
Member
BlackSwan - May be i'm not doing a good job explaining here, but what i'm trying to compare two copies of the same table i.e. copy of the name_address table from yesterday (PREV_NAME_ADDRESS) with a copy of the name_address table (NAME_ADDRESS) built today.
I am not obsessing on effeciency but the way I see it thats one of the main concerns I have due to the size of the tables. If not the logic I have listed above will work just fine on a smaller table.

cookiemonster - The business case here is that we have a large table which gets built everyday (drop and create) using a bunch of underlying tables. We need to determine the data for which the address values have changed and generate an extract of those specific records.
This is a rather simple effort with the basic steps outlined above. I am trying to determine the best approach for the volume we will be handling. Hope this makes sense.

John Watson - I don't think triggers can be used here cos this table is a daily drop and create. If it were to be an update that would have been a good solution.

quirks - I had not thought about this approach. I will certainly give it a try.
Re: Large volume data compare - NEED HELP [message #665673 is a reply to message #665672] Thu, 14 September 2017 13:27 Go to previous messageGo to next message
BlackSwan
Messages: 26766
Registered: January 2009
Location: SoCal
Senior Member
Are the ID values really UNIQUE between the tables?

SELECT COL1, ..., COL18 FROM TODAYS_TABLE TT
MINUS
SELECT COL1, ..., COL18 FROM YESTERDAYS_TABLE YT;

results set from above is changed rows.
If all 18 columns are INDEXED, then the 2 tables themselves will NOT be read.
Re: Large volume data compare - NEED HELP [message #665674 is a reply to message #665673] Thu, 14 September 2017 21:25 Go to previous messageGo to next message
Member2014
Messages: 33
Registered: July 2014
Location: USA
Member
Thanks BlackSwan I will try this approach.
Re: Large volume data compare - NEED HELP [message #665678 is a reply to message #665674] Fri, 15 September 2017 02:31 Go to previous messageGo to next message
quirks
Messages: 82
Registered: October 2014
Member
@BlackSwan: You might not be able to identify the deleted rows.

@Member2014: I just found this promising approach: http://blog.sqlora.com/en/how-to-simplify-the-data-historization/
Below is an adaption to your case. I'd be glad if you could give me a feed back which of my versions is faster. This one should be (because of skipping the full outer join), but you never know.
WITH
    UNION_OLD_NEW
    AS
        (SELECT 'NEW' SOURCE
               ,NAME_ADDRESS.ID
               ,NAME_ADDRESS.FNAME
               ,NAME_ADDRESS.LNAME
               ,NAME_ADDRESS.ADDRESS1
               ,NAME_ADDRESS.ADDRESS2
               ,NAME_ADDRESS.CITY
               ,NAME_ADDRESS.STATE
               ,NAME_ADDRESS.ZIP
           FROM NAME_ADDRESS
         UNION ALL
         SELECT 'OLD' SOURCE
               ,PREV_NAME_ADDRESS.ID
               ,PREV_NAME_ADDRESS.FNAME
               ,PREV_NAME_ADDRESS.LNAME
               ,PREV_NAME_ADDRESS.ADDRESS1
               ,PREV_NAME_ADDRESS.ADDRESS2
               ,PREV_NAME_ADDRESS.CITY
               ,PREV_NAME_ADDRESS.STATE
               ,PREV_NAME_ADDRESS.ZIP
           FROM PREV_NAME_ADDRESS),
    PREPARE_4_ANALYZE
    AS
        (SELECT UNION_OLD_NEW.*
               ,COUNT(*)
                    OVER(
                        PARTITION BY ID
--                                    ,FNAME
--                                    ,LNAME
--                                    ,ADDRESS1
--                                    ,ADDRESS2
                                    ,CITY
                                    ,STATE
                                    ,ZIP
                    )
                    CNT  --  cnt = 2 - two versions are the same, otherwise INS/UPD/DEL
               ,COUNT(*) OVER (PARTITION BY ID)
                    CNT_KEY  -- the count of versions per ID
           FROM UNION_OLD_NEW),
    EXECUTE_ANALYZE
    AS
        (SELECT PREPARE_4_ANALYZE.*
               ,CASE
                    WHEN (CNT = 1 AND CNT_KEY = 2) THEN 'U'
                    WHEN (CNT = 2 AND CNT_KEY = 2) THEN 'NONE'
                    WHEN (CNT = 1 AND CNT_KEY = 1) THEN 
                         CASE 
                              WHEN SOURCE = 'OLD' THEN 'D' 
                              WHEN SOURCE = 'NEW' THEN 'I' 
                         END
                END
                    AS ACTION
           FROM PREPARE_4_ANALYZE)
SELECT EXECUTE_ANALYZE.ID
      ,EXECUTE_ANALYZE.FNAME
      ,EXECUTE_ANALYZE.LNAME
      ,EXECUTE_ANALYZE.ADDRESS1
      ,EXECUTE_ANALYZE.ADDRESS2
      ,EXECUTE_ANALYZE.CITY
      ,EXECUTE_ANALYZE.STATE
      ,EXECUTE_ANALYZE.ZIP
      ,EXECUTE_ANALYZE.ACTION
  FROM EXECUTE_ANALYZE
 WHERE 
      (SOURCE = 'NEW' AND ACTION IN ('U', 'I', 'NONE')) OR 
      (SOURCE = 'OLD' AND ACTION = 'D') -- filter values
Re: Large volume data compare - NEED HELP [message #665704 is a reply to message #665678] Fri, 15 September 2017 15:41 Go to previous message
Member2014
Messages: 33
Registered: July 2014
Location: USA
Member
quirks - This worked perfect.

Another step I added was to create work tables with just the columns I need to compare. The work table creation took about 3 minutes and data comparison for the full volume about 20 mins. Thanks so much for the help. Really appreciate it.
Previous Topic: Adding ax extra column to the existing constraint
Next Topic: LPX-00230: invalid character 0 (U+0000) found in a Name or Nmtoken
Goto Forum:
  


Current Time: Fri Mar 29 06:04:03 CDT 2024