Building a Dynamic Oracle ETL Procedure
If you are on Oracle 9i or higher, you have external tables and pipelined table functions available to meet your ETL needs. But in order to utilize these tools in a dynamic ETL environment, you need to design PL/SQL procedures that can support the loading of files dynamically. This article will provide you with one design that has proven to be very robust and scalable.
In order to understand the procedure design, you should have a basic understanding of how external tables and pipelined table functions are used. You can find more information and detailed examples at http://www.oracle-developer.com/oracle_etl.html. The result of both tools is the ability to select data from a file and insert it into a relational table.
The ETL procedure should accept the file name or path as an input parameter. If you need to capture any information from the file path into the relational table, then it is a good idea to use the file path as an input. For example if I had sales data that was uploaded from 3 different stores in 3 different incoming directories and I wanted to load all stores into the same table, I would probably want to include a column to identify the store in my table. However, if the file format and file name did not include the store information, I would need to capture the store information from the file path.
STORE A: /incoming/sales_data/storeA/20070720.dat
STORE B: /incoming/sales_data/storeB/20070720.dat
STORE C: /incoming/sales_data/storeC/20070720.dat
We would pass the entire path into the ETL procedure and the procedure would extract the store using a SQL query in the PL/SQL such as
SQL> with s as (select '/incoming/sales_data/storeA/20070720.dat' p_file_path from dual) 2 select substr(p_file_path, instr(p_file_path, '/', 1,3)+1, 3 instr(p_file_path, '/', 1,4)-instr(p_file_path, '/', 1,3)-1) STORE_NAME from s 4 / STORE_ ------ storeA
We could then extract the file name using a query such as
SQL> with s as (select '/incoming/sales_data/storeA/20070720.dat' p_file_path from dual) 2 select substr(p_file_path, instr(p_file_path,'/', -1)+1, 3 length(p_file_path)-instr(p_file_path,'/', -1)) FILE_NAME from s 4 / FILE_NAME ------------ 20070720.dat
If we want, we can get the file size as well. First create the following function:
CREATE OR REPLACE FUNCTION flength ( location_in IN VARCHAR2, file_in IN VARCHAR2 ) RETURN PLS_INTEGER IS TYPE fgetattr_t IS RECORD ( fexists BOOLEAN, file_length PLS_INTEGER, block_size PLS_INTEGER ); fgetattr_rec fgetattr_t; BEGIN UTL_FILE.fgetattr ( location => location_in, filename => file_in, fexists => fgetattr_rec.fexists, file_length => fgetattr_rec.file_length, block_size => fgetattr_rec.block_size ); RETURN fgetattr_rec.file_length; END flength;
And now, inside the PL/SQL procedure, we can capture the file size:
declare l_file_size number; begin l_file_size := flength ('DAT_DIR’,l_filename); end; /
Now that we have the file name, we are ready to dynamically redefine the external table’s data source.
From our original definition, the external table SAMPLE_EXT is pointing to a file called “sample.csv”. In order to load the different files from our incoming directories based on the input parameter, we will need to alter the table appropriately. After we have successfully stored the file name from the input parameter, we are ready to define our external table to point to the current data file.
execute immediate 'alter table SAMPLE_EXT location(''' || p_file_name || ''')';
Now that our external table is pointing at the file from our input parameter, we are ready to begin our insert from the external table to our relational table.
Since the external table is pointing at our new file, all we need to do is create the insert statement. We can pass any additional variables from the statement.
Insert into sample_tab Select * from sample_ext /
If there were some additional values defined in the destination table, we could include the variables in our SELECT. For example if we added a column, LOAD_DATE to the end of the relational table, SAMPLE_TAB, then we would just add SYSDATE at the end of the select fields as such…
Insert into sample_tab Select A.*, sysdate from sample_ext A /
If we need to capture the number of rows inserted, we can do so using (prior to the commit)
ln_rowcount := SQL%rowcount;
Here are the set up steps for the demonstration
Create directory DAT_DIR as ‘/usr/tmp’;
File: july19.csv saved to /usr/tmp
ID, BEGIN_DATE, FIRST_NAME, LAST_NAME, STATUS 1, 19-JUL-07, John, Adams, Active 2, 19-JUL-07, Tyler, Howell, Active 3, 19-JUL-07, Jim, Lopez, Active 4, 19-JUL-07, Carlos, White, Inactive 5, 19-JUL-07, Scott, Tiger, Active
File: july20.csv saved to /usr/tmp
ID, BEGIN_DATE, FIRST_NAME, LAST_NAME, STATUS 1, 20-JUL-07, John, Adams, Active 2, 20-JUL-07, Tyler, Howell, Active 3, 20-JUL-07, Jim, Lopez, Active 4, 20-JUL-07, Carlos, White, Inactive 5, 20-JUL-07, Scott, Tiger, Active
External and matching relational table definition
CREATE TABLE SAMPLE_EXT ( ID NUMBER, BEGIN_DATE VARCHAR2(10), FIRST_NAME VARCHAR2(30), LAST_NAME VARCHAR2(30), STATUS VARCHAR2(10) ) ORGANIZATION EXTERNAL ( TYPE ORACLE_LOADER DEFAULT DIRECTORY DAT_DIR ACCESS PARAMETERS ( records delimited by newline LOAD WHEN (ID != ‘ID’) badfile BAD_DIR:'SAMPLE_EXT%a_%p.bad' logfile LOG_DIR:'SAMPLE_EXT%a_%p.log' fields terminated by ',' optionally enclosed by '"' LRTRIM MISSING FIELD VALUES ARE NULL ) LOCATION (‘sample.csv') ) PARALLEL 4 REJECT LIMIT 1000; CREATE TABLE SAMPLE_TAB ( ID NUMBER, BEGIN_DATE VARCHAR2(10), FIRST_NAME VARCHAR2(30), LAST_NAME VARCHAR2(30), STATUS VARCHAR2(10) );
Now that the objects are in place, we are ready to define our procedure.
The Final Procedure
Including each of the steps above, we can create a PL/SQL procedure that loads any file from our incoming directory (DAT_DIR) based on the input parameter. This gives us a powerful tool for loading files in a large scale, dynamic environment. Based on the information above, the PL/SQL code might be…
SQL> CREATE OR REPLACE FUNCTION SAMPLE_ETL_FNC( P_FILE_PATH VARCHAR2 ) RETURN NUMBER AS 2 l_filename varchar2(100); 3 l_rowcount number; 4 l_sqlerrm varchar(2000); 5 l_filesize number; 6 BEGIN 7 8 DBMS_OUTPUT.PUT_LINE(to_char(sysdate, 'YYYY-MON-DD HH:MI:SS') || ' - BEGIN ETL PROCEDURE - Path : ' || p_file_path); 9 10 --Flexible to allow for file name or file path 11 if instr(p_file_path, '/') > 0 then 12 select substr(p_file_path, instr(p_file_path,'/', -1)+1, 13 length(p_file_path)-instr(p_file_path,'/', -1)) 14 INTO l_filename 15 from dual; 16 else 17 l_filename := p_file_path; 18 end if; 19 20 --Get file size if required 21 --l_filesize := flength ('DAT_DIR',l_filename); 22 DBMS_OUTPUT.PUT_LINE('Filename: ' || l_filename); 23 24 BEGIN 25 --Alter the Log and Bad file names for debugging 26 --Can modify the bad and log file names if required 27 DBMS_OUTPUT.PUT_LINE('ALTER EXTERNAL TABLE DATA SOURCE'); 28 execute immediate 'alter table SAMPLE_EXT location('''||l_filename||''')'; 29 --Prepare Insert SQL 30 DBMS_OUTPUT.PUT_LINE('BEGIN INSERT FROM EXTERNAL TO DATA TABLE'); 31 execute immediate 'Insert into sample_tab select * from sample_ext A'; 32 l_rowcount := SQL%rowcount; 33 DBMS_OUTPUT.PUT_LINE(to_char(sysdate,'DD-MON-YYYY HH:MI:SS AM') || ': INSERT SUCCESSFUL... IN SERTED ' || l_rowcount || ' ROWS'); 34 COMMIT; 35 36 RETURN 0; 37 exception 38 when others then 39 DBMS_OUTPUT.PUT_LINE('INSERT FAILED'); 40 l_sqlerrm:=sqlerrm; 41 DBMS_OUTPUT.PUT_LINE(l_sqlerrm); 42 ROLLBACK; 43 commit; 44 RETURN 1; 45 END; 46 47 exception 48 when others then 49 DBMS_OUTPUT.PUT_LINE(SQLERRM); 50 RETURN 1; 51 END; 52 / Function created.
Test the Oracle ETL function by placing the sample files described above in the DAT_DIR folder.
SQL> select * from sample_tab 2 / no rows selected SQL> set serveroutput on SQL> declare 2 l_return number; 3 begin 4 l_return := SAMPLE_ETL_FNC('july19.csv'); 5 dbms_output.put_line('RETURN STATUS: ' || l_return); 6 end; 7 / 2007-JUL-24 04:49:11 - BEGIN ETL PROCEDURE - Path: july19.csv Filename: july19.csv ALTER EXTERNAL TABLE DATA SOURCE BEGIN INSERT FROM EXTERNAL TO DATA TABLE 24-JUL-2007 04:49:11 PM: INSERT SUCCESSFUL... INSERTED 5 ROWS RETURN STATUS: 0 PL/SQL procedure successfully completed. SQL> select * from sample_tab 2 / ID BEGIN_DATE FIRST_NAME LAST_NAME STATUS ---------- ---------- ------------------------------ ------------------------------ -------- 1 19-JUL-07 John Adams Active 2 19-JUL-07 Tyler Howell Active 3 19-JUL-07 Jim Lopez Active 4 19-JUL-07 Carlos White Inactive 5 19-JUL-07 Scott Tiger Active 5 rows selected. SQL> declare 2 l_return number; 3 begin 4 l_return := SAMPLE_ETL_FNC('july20.csv'); 5 dbms_output.put_line('RETURN STATUS: ' || l_return); 6 end; 7 / 2007-JUL-24 04:50:16 - BEGIN ETL PROCEDURE - Path: july20.csv Filename: july20.csv ALTER EXTERNAL TABLE DATA SOURCE BEGIN INSERT FROM EXTERNAL TO DATA TABLE 24-JUL-2007 04:50:16 PM: INSERT SUCCESSFUL... INSERTED 5 ROWS RETURN STATUS: 0 PL/SQL procedure successfully completed. SQL> select * from sample_tab 2 / ID BEGIN_DATE FIRST_NAME LAST_NAME STATUS ---------- ---------- -------------------- ------------------------------ -------- 1 19-JUL-07 John Adams Active 2 19-JUL-07 Tyler Howell Active 3 19-JUL-07 Jim Lopez Active 4 19-JUL-07 Carlos White Inactive 5 19-JUL-07 Scott Tiger Active 1 20-JUL-07 John Adams Active 2 20-JUL-07 Tyler Howell Active 3 20-JUL-07 Jim Lopez Active 4 20-JUL-07 Carlos White Inactive 5 20-JUL-07 Scott Tiger Active 10 rows selected.
V.J. Jain is an Oracle Database and Applications Consultant and owner of Varun Jain, Inc. - Oracle Consulting. With over 12 years of experience with database systems, he specializes in database performance, development, interfaces, and high performance solutions. Based in Orange County, California, he actively explores Oracle's newest technologies and is a member of the Oracle Partner Network and Beta program. Additional material by him can be found at http://www.oracle-developer.com.