Top-n query tuning in huge OLTP applications
Top-n Query is one of the more advanced SQL problems. How does one retrieve N first (or least) rows from a record set? For example, how does one find the top five highest-paid employees in a given department?
People with no experience in such situation usually try to solve the problem with a query like this:
SELECT e.* FROM emp e WHERE ROWNUM < 6 ORDER BY sal DESC
Of course, this is the incorrect solution. The 'where rownum<6' condition is executed BEFORE all the rows are sorted. This means that instead of the top five highest-paid employees, you get a random set of five employees sorted by their salaries. Below is the correct solution:
SELECT * FROM (SELECT ROWNUM, e.* FROM emp e ORDER BY sal DESC) WHERE ROWNUM < 6
Similar queries are often used in various reports and sometimes have much more complicated structure - for instance:
SELECT ename, sal, dname, manager FROM (SELECT ROWNUM, e.ename, e.sal, d.dname, e2.ename AS manager FROM emp e, dept d, emp e2 WHERE e.deptno = d.deptno AND e.mgrno = e2.empno AND e.job = 'SALESMAN' ORDER BY e.sal DESC) WHERE ROWNUM < 6;
In this case, the Oracle optimizer has to create a whole Cartesian product described by the inline view, then sort it out and finally retrieve the first five rows. This query is executed very quickly on SCOTT's tables, but the problem becomes complicated if the table EMP consists of one milion rows. The problem also becomes even more complicated if (for example) besides just the name of employee and his/her manager, you also need to get the employee's address, phone number, shop's address, etc. All these data must be read from another tables and joined with the inline view before sorting. It may cause the query to execute very slowly.
Is there any other way to do it?
Yes, there is. Just modify the starting query:
SELECT e.ename, e.sal, d.dname, e2.ename FROM (SELECT * FROM (SELECT empno, ename, sal, mgrno, deptno FROM emp WHERE job = 'SALESMAN' ORDER BY emp.sal DESC) WHERE ROWNUM < 6) e, dept d, emp e2 WHERE e.deptno = d.deptno AND e.mgrno = e2.empno ORDER BY sal DESC
At first glance it looks like an unnecessary complication. But in fact the structure of the query forces the optimizer to find the top five highest-paid employees, retrieve their data, and just after that join them with other tables. The benefit is obvious: the join operation is executed for only five rows, instead of hundreds of thousands. The proper use of top-n queries may also increase efficiency when the application retrieves many rows for the given conditions, but the user is actually interested only in a few of them. For example, the application user might need to look through the invoices (with all the details, products, prices, etc.) paid by a selected customer. The user is usually more interested in the most recent invoices, so the data must be sorted in descending order by invoice date - for example:
SELECT c.custno, custname, i.invno, invdate, detno, p.prodno, proddesc, detamount, prodprice FROM customers c, invoices i, details d, products p WHERE c.custno = i.custno AND i.invno = d.invno AND d.prodno = p.prodno AND c.custno = 123 ORDER BY i.invdate DESC;
The real application using a query like this worked very fast at first, but after several months it slowed down, and users began to complain. Searching for the data of 'heavy users' with thousands of invoices took a lot of time and became very irritating. After a few meetings with the application users it was found that, in 95% of the cases, information about the last 20 invoices was sufficient. This data is easy to retrieve by the query:
SELECT c.custno, custname, i.invno, invdate, detno, p.prodno, proddesc, detamount, prodprice FROM (SELECT * FROM (SELECT custno, invno, invdate FROM invoices WHERE custno = 123 ORDER BY invdate DESC) WHERE ROWNUM < 21) i, details d, products p, customers c WHERE c.custno = i.custno AND i.invno = d.invno AND p.prodno = d.prodno;
This query sorts data only in the INVOICES table, retrieves just 20 rows, and joins them with other tables. Thanks to this, the query is always executed in less than 0,1 seconds. Of course, the application user has also the opportunity to search more than the last 20
invoices. In such a case, the user has to check the proper check box on the form. This causes the query to execute in the original, full (and slow) version - but at least the user is aware of this.
The examples described above may seem very trivial to many users, but I could give many similar cases: sorting customers by the amount of bought goods, by the number of executed money transfers, by the value
of paid bills; printing 20 operations per bank statement, and so on. In all these situations, moving the conditions and "rownum" clause to the inner query may increase efficiency significantly. For these who are not yet convinced, here are two simple scripts that illustrate the described problems. Both scripts were tested on Oracle9i Enterprise Edition Release 9.2.0.6.0 on HP-UX.
Example 1: Setup
CREATE TABLE emp ( empno NUMBER(10), ename VARCHAR2(100), deptno NUMBER(10), job VARCHAR2(20), sal NUMBER(10,2), mgrno NUMBER(10)); CREATE TABLE dept ( deptno NUMBER(10), dname VARCHAR2(100)); CREATE SEQUENCE emp_seq START WITH 1; CREATE SEQUENCE dept_seq START WITH 1; CREATE INDEX emp_idx1 ON emp ( empno); CREATE INDEX emp_idx2 ON emp ( job, sal); CREATE INDEX dept_idx1 ON dept ( deptno); CREATE OR REPLACE PROCEDURE test1 AS salary NUMBER(10,2); manager NUMBER(10,2); BEGIN dbms_random.Initialize(1000); -- departments FOR i IN 1.. 100 LOOP INSERT INTO dept (deptno,dname ) VALUES (dept_seq.nextval, 'DEPT' || To_char(dept_seq.currval)); END LOOP; -- employees in every department FOR j IN 1.. 100 LOOP FOR i IN 1.. 1000 LOOP salary := Abs(Mod(dbms_random.random, 100000)); manager := Abs(Mod(dbms_random.random, 100000)); INSERT INTO emp (empno,ename,deptno,job,sal,mgrno ) VALUES (emp_seq.nextval, 'SCOTT' || To_char(emp_seq.nextval), j, 'CLERK', salary, manager); INSERT INTO emp (empno,ename,deptno,job,sal,mgrno ) VALUES (emp_seq.nextval, 'SCOTT' || To_char(emp_seq.nextval), j, 'ANALYST', salary, manager); INSERT INTO emp (empno,ename,deptno,job,sal,mgrno ) VALUES (emp_seq.nextval, 'SCOTT' || To_char(emp_seq.nextval), j, 'CLERK', salary, manager); INSERT INTO emp (empno,ename,deptno,job,sal,mgrno ) VALUES (emp_seq.nextval, 'SCOTT' || To_char(emp_seq.nextval), j, 'SALESMAN', salary, manager); INSERT INTO emp (empno,ename,deptno,job,sal,mgrno ) VALUES (emp_seq.nextval, 'SCOTT' || To_char(emp_seq.nextval), j, 'MANAGER', salary, manager); END LOOP; COMMIT; END LOOP; END; / EXEC test1; ANALYZE TABLE dept COMPUTE STATISTICS; ANALYZE TABLE emp COMPUTE STATISTICS;
Query 1: Slow
SELECT ename,
sal,
dname,
manager
FROM (SELECT ROWNUM,
e.ename,
e.sal,
d.dname,
e2.ename AS manager
FROM emp e,
dept d,
emp e2
WHERE e.deptno = d.deptno
AND e.mgrno = e2.empno
AND e.job = 'SALESMAN'
ORDER BY e.sal DESC)
WHERE ROWNUM < 6;
Time: 6,125 s
Query 2: Fast
SELECT e.ename,
e.sal,
d.dname,
e2.ename
FROM (SELECT *
FROM (SELECT empno,
ename,
sal,
mgrno,
deptno
FROM emp
WHERE job = 'SALESMAN'
ORDER BY emp.sal DESC)
WHERE ROWNUM < 6) e,
dept d,
emp e2
WHERE e.deptno = d.deptno
AND e.mgrno = e2.empno
ORDER BY sal DESC
Time: 0,65 s
Example 2: Setup
CREATE TABLE customers ( custno NUMBER(10), custname VARCHAR2(20)); CREATE TABLE invoices ( invno NUMBER(10), custno NUMBER(10), invdate DATE); CREATE TABLE details ( detno NUMBER(10), invno NUMBER(10), prodno NUMBER(10), detamount NUMBER(10)); CREATE TABLE products ( prodno NUMBER(10), proddesc VARCHAR2(20), prodprice NUMBER(10,2)); CREATE SEQUENCE cust_seq START WITH 1; CREATE SEQUENCE inv_seq START WITH 1; CREATE SEQUENCE det_seq START WITH 1; CREATE SEQUENCE prod_seq START WITH 1; CREATE INDEX cust_idx1 ON customers ( custno); CREATE INDEX inv_idx1 ON invoices ( invno); CREATE INDEX inv_idx2 ON invoices ( custno, invdate); CREATE INDEX det_idx1 ON details ( detno); CREATE INDEX det_idx2 ON details ( invno); CREATE INDEX prod_idx1 ON products ( prodno); CREATE OR REPLACE PROCEDURE test2 AS customer NUMBER(10); product NUMBER(10); BEGIN dbms_random.Initialize(1000); -- customers FOR i IN 1.. 1000 LOOP INSERT INTO customers VALUES (cust_seq.nextval, 'CUSTOMER ' || To_char(cust_seq.currval)); END LOOP; -- products FOR i IN 1.. 1000 LOOP INSERT INTO products VALUES (prod_seq.nextval, 'PRODUCT' || To_char(prod_seq.currval), prod_seq.currval); END LOOP; -- invoices FOR i IN 1.. 100000 LOOP customer := Abs(Mod(dbms_random.random, 1000)); INSERT INTO invoices VALUES (inv_seq.nextval, customer, SYSDATE + Abs(Mod(inv_seq.currval, 100))); -- details FOR j IN 1.. 10 LOOP product := Abs(Mod(dbms_random.random, 1000)); INSERT INTO details VALUES (det_seq.nextval, inv_seq.currval, product, Mod(j, 5)); END LOOP; COMMIT; END LOOP; END; / EXEC test2; ANALYZE TABLE customers COMPUTE STATISTICS; ANALYZE TABLE products COMPUTE STATISTICS; ANALYZE TABLE invoices COMPUTE STATISTICS; ANALYZE TABLE details COMPUTE STATISTICS;
Query 1: Slow
SELECT c.custno,
custname,
i.invno,
invdate,
detno,
p.prodno,
proddesc,
detamount,
prodprice
FROM customers c,
invoices i,
details d,
products p
WHERE c.custno = i.custno
AND i.invno = d.invno
AND d.prodno = p.prodno
AND c.custno = 123
ORDER BY i.invdate DESC;
Time: 0,578 s
Query 2: Fast
SELECT c.custno,
custname,
i.invno,
invdate,
detno,
p.prodno,
proddesc,
detamount,
prodprice
FROM (SELECT *
FROM (SELECT custno,
invno,
invdate
FROM invoices
WHERE custno = 123
ORDER BY invdate DESC)
WHERE ROWNUM < 21) i,
details d,
products p,
customers c
WHERE c.custno = i.custno
AND i.invno = d.invno
AND p.prodno = d.prodno;
Time: 0,093 s
- Marcin Miller's blog
- Log in to post comments
Comments
What's the difference
Why not use standard SQL's "ROW_NUMBER() OVER" window-function when Oracle supports it?
Finally: Beware of tie conditions. In some situations - probably more than one may think at first - the "RANK() OVER" function is what's really needed. I've written about such things at http://troels.arvin.dk/db/rdbms/ (section "Limiting result sets").
Hi, Good up to very good
Hi,
Good up to very good article.
Since I was involved in performance issues regarding top n queries recently I would like to remark
1. The importance of an index on the sort column created with desc attribute.
(create index emp_idx on emp (sal desc) )
2. The importance of optimizer_mode first_rows_n
3. As suggested here above the rank and the dense rank functions can be used as well, execution plan is often better with a WINDOW SORT PUSHED RANK in the explain plan.
Recently I faced this query, table t1 contained 1.000.000 rows
select * from (select * from t1 order by col1 desc, col2 desc) where rownum < 21
A full table scan on t1 was shown with set autotrace on
I created an index
create index t1_idx1 on t1 (col1 desc, col2 desc)
I gathered statistics with dbms_stats
I ran the same query with optimizer_mode = first_rows_10
I reduces IO with factor 15.000 !
Regards
Guy Lambregts