RE: Fastest way to count exact number of rows in a very large table

From: Mark W. Farnham <mwf_at_rsiz.com>
Date: Mon, 5 Oct 2020 18:48:13 -0400
Message-ID: <058b01d69b69$9bcec8e0$d36c5aa0$_at_rsiz.com>



Doing his approximate math, 12x20 is 240. My guess would be that exceeds his optimal parallelism.  

I would have him guess his maximum parallelism, generate that number of scripts each containing a list of select count(*) from tab.part_name;  

Run those n scripts concurrently, perhaps inserting the partition subtotals somewhere so you don’t have to add them up. Serially you avoid pairs of producers and consumers, so you should be able to run at higher logical parallelism. IF these partitions are stored in different tablespaces on different devices, organizing the threads to pound on the devices a balanced number at a time may be useful if the speed of spinning rust is actually your limiting factor.  

Or, if you are on a sufficient version, generate the concurrent union all count of all the partition and tune your parallelism. Probably each piece serial still wins. I think the example in the documentation is correct, but I can’t really remember.  

The point is to do the counting in logical parallel by partition so Oracle will stream down the individual partitions for sure without any chance of a weird hash merge assembling your row sources.  

I believe this is what the fellow showing the pseudo code for the pipelined function intended, but the point about consuming and holding down sessions is valid. If you generate the tab.partname queries statically for either separate queries or concurrent union all (which has the advantage of computing the overall answer for you) it should work pretty darn fast.  

What parallel was actually used for the 42 minute version? (Is that 42 real or from the book?)  

From: oracle-l-bounce_at_freelists.org [mailto:oracle-l-bounce_at_freelists.org] On Behalf Of Andy Sayer Sent: Monday, October 05, 2020 5:33 PM
To: Reen, Elizabeth
Cc: Ashoke Mandal; ahmed.fikri_at_t-online.de; gogala.mladen_at_gmail.com; list, oracle Subject: Re: Fastest way to count exact number of rows in a very large table  

There’s no point doing it at all :)

If the XTTS succeeds and you don’t end up with a physically corrupt file, all the rows that were in the table before you made the tablespace read only will still be there.  

BTW I have strong doubts that just sticking a query inside a pipelined function will make it any faster. Assuming you aren’t writing to the table at the same time (which would be a silly thing to do if the goal is to get the same count as on a different system). /*+parallel*/ is valid syntax for running a query in parallel. I'd guess that this specific code would be much slower as it opens a parallel cursors to get the partition names from some table and needs that cursor open before it can query the actual table - you'll be wasting parallel threads doing nothing.  

Mladen, RE: APPROX_FOR_COUNT_DISTINCT, it reduces the CPU impact for count distinct but you don't have that requirement for a standard count - there's no need to decide whether you've seen the value before. It is incredib ly fast (and highly accurate) when compared to count distinct but a standard count(*) would be quicker (and here the main time taken would probably be from IO).  

Thanks,

Andy  

On Mon, 5 Oct 2020 at 22:00, Reen, Elizabeth <elizabeth.reen_at_citi.com> wrote:

              I assumed that it would be done in parallel. Agreed that there was no sense doing it serially.    

Liz      

From: [External] ahmed.fikri at t-online.de <ahmed.fikri_at_t-online.de> Sent: Monday, October 5, 2020 4:35 PM
To: gogala.mladen_at_gmail.com; Reen, Elizabeth [ICG-IT]; Ashoke Mandal Cc: andysayer_at_gmail.com; list, oracle Subject: AW: Fastest way to count exact number of rows in a very large table  

If you are just interested in getting the exact number of rows in a large partitioned table, then the fasted way (in my opinion) is to use PL/SQL. Looping the partition serially will not help (this is more slower than select count(*) from huge_table).You have to create a pipelined function and use the parallel_enable feature.

Pseudo code( give only the Idea, if you adjust it you can reduce the time significantly):

create or replace package pkg_test as

  type xxx IS RECORD(column_value VARCHAR2(200));

  type t_parallel_test_ref_cursor IS REF CURSOR RETURN xxx;

  function fu_count(p_cursor t_parallel_test_ref_cursor);

end;

/

create or replace package body pkg_test as

function fu_count_part(p_part_nameVARCHAR2, p_dop NUMBER) RETURN NUMBER IS

  v_count number := 0;
BEGIN

  • you can try here to use bind variable to avoid hard parsing (just make sure that the partition pruning works EXECUTE IMMEDIATE 'select /*+ look how to use parallel hint */ count(*) from huge_table partition('||p_part_name||')' into v_count; RETURN v_count;

EXCEPTION   when others then

    return 0;
END fu_count_part;  

function fu_count(p_cursor t_parallel_test_ref_cursor) RETURN t2_list PIPELINED PARALLEL_ENABLE(PARTITION p_cursor BY HASH(column_value)) IS ret t2;
  x VARCHAR2(33);
BEGIN
  LOOP
  FETCH p_cursor INTO x;
  exit when p_cursor%NOTFOUND;
 <https://urldefense.com/v3/__http:/ret.id__;!!Jkho33Y!w1JYGGiqBHQ42f2zBi1qLYakaWuRcTJMLEK2e97Hiw8tye54siYMHHQONRltggJkdQ$> ret.id := fu_count_part(x); PIPE ROW(ret);
END LOOP;
END fu_count;

end pkg_test  

And then this will return the total count:

SELECT count(*) FROM pkg_test.fu_count(CURSOR(SELECT /*+ parallel(t 16) */ partition_name COLUMN_VALUE FROM table_hold_all_partitions_name t),16);  

If you really want to cut the query time to less than 10 minutes, just try getting the code above working.  

Best regards

Ahmed    

-----Original-Nachricht-----

Betreff: Re: Fastest way to count exact number of rows in a very large table

Datum: 2020-10-05T21:58:55+0200

Von: "Mladen Gogala" <gogala.mladen_at_gmail.com>

An: "elizabeth.reen_at_citi.com" <elizabeth.reen_at_citi.com>, "Ashoke Mandal" <ramukam1983_at_gmail.com>      

You can also try the following:

    SELECT /*+ PARALLEL(16) */ 100000*count(*) FROM TABLE SAMPLE(0.001);

That would give almost correct count of rows in the table. Accuracy would be similar to SELECT NUM_ROWS from USER_TABLES WHERE TABLE_NAME=<TABLE_NAME>;  

BTW, has anyone played with APPROX_FOR_COUNT_DISTINCT parameter in 19c?

Regards  

On Mon, 2020-10-05 at 19:40 +0000, Reen, Elizabeth wrote:

              Since it is partitioned, why don’t you just count each partition separately? Have you run stats on the old partitions? Assuming that prior years do not change, you should be able to get a count there. If you do an import, then you can get the number of rows from the log. A transportable tablespace, will not mount if there is something wrong.  

Liz  

Liz Reen
CPB Database Administration  

Dev Servers: Oracle | ORasS | Golden Gate | Sybase | MS-SQL | MSaaS | Neo4j | HBase - Prod Servers: Golden Gate | Neo4j | HBase - Team Email: *CPB Database - Hotline: 1 718 248 1416  

ServiceNow For: Oracle | OraaS | Golden Gate : ICG NA EX US CPB ORACLE APPLICATION DBAS – For Sybase | MS-SQL | Neo4j | HBase: ICG NA EX CPB SYBASE APPLICATION DBAS    

From: [gmail.com <https://urldefense.com/v3/__http:/gmail.com__;!!Jkho33Y!w1JYGGiqBHQ42f2zBi1qLYakaWuRcTJMLEK2e97Hiw8tye54siYMHHQONRlaOO1SKg$> ] Ashoke Mandal <ramukam1983_at_gmail.com> Sent: Monday, October 5, 2020 3:14 PM
To: Reen, Elizabeth [ICG-IT]
Cc: andysayer_at_gmail.com; ahmed.fikri_at_t-online.de; list, oracle Subject: Re: Fastest way to count exact number of rows in a very large table  

Hello Ahmed/Andy/Jackson/Mark/Gogala/Liz, Thanks for your response to my posting. Even though my question was how to improve the query time of a query to check row count in a big table, you have also brought up many good points related to cross platform migration.  

Here is some information regarding my table:

We have used transportable tablespace for data migration.

The primary key consists of three columns (UT_ID, UT_SEQ, TEST_DATE).

This table is partitioned by date and has one partition for every month. So, 12 partitions for every calendar year and has data for the last 20 years.

After adding a parallel hint as described below the query time went down from 2.2 hours to 42 min. Let me know if you have any more recommendations to improve the query time for this select statement.  

select /*+ parallel */ to_char(count(*), '999,999,999,999') from test_data;  

Ashoke  

On Mon, Oct 5, 2020 at 10:40 AM Reen, Elizabeth <elizabeth.reen_at_citi.com> wrote:

              We just completed such a transition. We kept the Oracle version the same so we could see the impact of Linux. Transportable tablespaces was how we did it. We were able to move a 17 terabyte database in under 10 hours.    

Liz    

From: [External] <mailto:oracle-l-bounce_at_freelists.org> oracle-l-bounce_at_freelists.org < <mailto:oracle-l-bounce_at_freelists.org> oracle-l-bounce_at_freelists.org> On Behalf Of [External] Andy Sayer Sent: Friday, October 2, 2020 3:09 PM
To: <mailto:ahmed.fikri_at_t-online.de> ahmed.fikri_at_t-online.de Cc: list, oracle; <mailto:ramukam1983_at_gmail.com> ramukam1983_at_gmail.com Subject: Re: Fastest way to count exact number of rows in a very large table  

Just because a table has the same number of rows, it doesn’t mean it has the same data. With 108 billion rows, your data is going to be changing quickly, in order to get accurate counts at the right point in time you’re going to end up keeping your application offline for a window before and after your migration.  

What you need to do is determine where you expect data to go missing and work out a way to check.  

This will depend on how you’re doing your migration, I would suggest you use Cross-Platform Transportable Tablespaces (Doc Id 371556.1) as that would allow you to do a physical import and just convert the files to the right endianness. This starts by making sure all data has been written to your data files (so they can be read only on the source system). As you’re working with the physical data files rather than the logical data (rows in tables), the only way you’re going to loose rows is by corrupting your files. You can check for corruption using RMAN once you’ve imported the converted files. No need to count all your rows, and no need to hope that that’s all you need to compare.  

Hope that helps,

Andy      

On Fri, 2 Oct 2020 at 19:38, ahmed.fikri_at_t-online.de <ahmed.fikri_at_t-online.de> wrote:

Hi Ashoke,      

could you send the execute plan of the query too? I think there is no general approach for that, it depends on several factors: whether the table has indexes (normal/bitmap) and in case the table has indexes the size of the table compared to the existing index...... But generally parallel processing should help.      

Best regards  

Ahmed              

-----Original-Nachricht-----
 

Betreff: Fastest way to count exact number of rows in a very large table  

Datum: 2020-10-02T19:45:19+0200  

Von: "Ashoke Mandal" <ramukam1983_at_gmail.com>  

An: "ORACLE-L" <oracle-l_at_freelists.org>                

Dear All,

I have a table with 108 billion rows and migrating this database from Oracle 11g on Solaris to Oracle 12c on Linux.      

After the migration I need to compare the row count of this table in both the source DB and the destination DB. It takes almost two hours to get the row count from this table.  

SQL> select to_char(count(*), '999,999,999,999') from test_data;

TO_CHAR(COUNT(*)



 108,424,262,144
Elapsed: 02:22:46.18      

Could you please suggest some tips to get the row count faster so that it reduces the cut-over downtime.      

Thanks,  

Ashoke          

--

Mladen Gogala
Database Consultant
Tel: (347) 321-1217



--

http://www.freelists.org/webpage/oracle-l Received on Tue Oct 06 2020 - 00:48:13 CEST

Original text of this message