Retrieve Rows Set by Set from the Database—Without Duplicates

Retrieve Rows Set by Set from the Database—Without Duplicates

ince the beginning of the year I’ve received multiple emails from DevX readers regarding a discussion titled “How can I retrieve set by set records from a table?” on the DevX Oracle discussion group. The discussion revolves around a reader who wants to retrieve rows in sets of 10 and each time he or she retrieves another set the 10 records must not be previously obtained (i.e., the first time the reader gets rows 1 to 10; the second time, 11 to 20 without getting rows 1 to 10 again; the next time, 21 to 30, etc.).

The reader interest has been so high and the number of erroneous approaches so prevalent that I have decided to dedicate this 10-Minute Solution to describing a solution to the reader’s request.

How do I retrieve rows set by set from a database without getting previously obtained rows in subsequent sets?

Use Oracle’s package structure as a holder for session-persistent data in combination with the new Oracle 8i BULK COLLECT feature and LIMIT keyword.

Use the EMP Table
In this 10-Minute Solution, I am going to use the EMP table in the query examples. This table belongs to the SCOTT schema of the Oracle starter database (provided with Oracle database server versions 7.3 to 9i).

Here is the table’s structure:

Name         Null?     Type———–  ——–  ————EMPNO        NOT NULL  NUMBER(4)ENAME                  VARCHAR2(10)JOB                    VARCHAR2(9)MGR                    NUMBER(4)HIREDATE               DATESAL                    NUMBER(7,2)COMM                   NUMBER(7,2)DEPTNO                 NUMBER(2)

Run the following query to get a look at the data in the EMP table:

select Empno, Ename, Job, Mgr, Hiredate, Salfrom   Emp;

The following listing shows the output of the above code:

  EMPNO ENAME      JOB              MGR HIREDATE         SAL——- ———- ——— ———- ——— ———-   7369 SMITH      CLERK           7902 17-DEC-80        800   7499 ALLEN      SALESMAN        7698 20-FEB-81       1600   7521 WARD       SALESMAN        7698 22-FEB-81       1250   7566 JONES      MANAGER         7839 02-APR-81       2975   7654 MARTIN     SALESMAN        7698 28-SEP-81       1250   7698 BLAKE      MANAGER         7839 01-MAY-81       2850   7782 CLARK      MANAGER         7839 09-JUN-81       2450   7788 SCOTT      ANALYST         7566 19-APR-87       3000   7839 KING       PRESIDENT            17-NOV-81       5000   7844 TURNER     SALESMAN        7698 08-SEP-81       1500   7876 ADAMS      CLERK           7788 23-MAY-87       1100   7900 JAMES      CLERK           7698 03-DEC-81        950   7902 FORD       ANALYST         7566 03-DEC-81       3000   7934 MILLER     CLERK           7782 23-JAN-82       130014 rows selected. 

The query returns only 14 rows, but that is enough if you want to retrieve records in sets of three (i.e., if you choose the value of N = 3). For all the code examples, you will retrieve the names and salaries (ENAME and SAL columns) of employees whose salaries are greater than or equal to $750.

The following query returns all the results in one batch:

select Ename,       Salfrom   Empwhere  Sal >= 750;

The following listing shows the output from the above code:

ENAME             SAL———- ———-SMITH             800ALLEN            1600WARD             1250JONES            2975MARTIN           1250BLAKE            2850CLARK            2450SCOTT            3000KING             5000TURNER           1500ADAMS            1100JAMES             950FORD             3000MILLER           130014 rows selected.

The query ran almost instantly and the results look great. Again, you have only 14 rows.

Erroneous Solutions: What You Should Not Do and Why
A common mistake among the erroneous solutions for retrieving record sets row by row is trying to utilize the Oracle pseudocolumn ROWNUM. For each row returned by a query, the ROWNUM pseudocolumn returns a number that indicates the order in which Oracle selects the row or a set of joined rows from a table. The first row selected has a ROWNUM of 1, the second has 2, and so on.

The first erroneous solution posted in the discussion used the following syntax:

select Ename, Salfrom   Empwhere  sal    >= 750and    rownum >= 1and    rownum <= 3;

The following listing shows the output:

ENAME             SAL———- ———-SMITH             800ALLEN            1600WARD             1250

The obvious question is how do you then get the next N (three) rows? The author suggested that you just need to keep adding N to the lower and upper boundaries, like this:

select Ename, Salfrom   Empwhere  sal    >= 750and    rownum >= 4and    rownum <= 6;

If the author had run his suggested solution more than once, he would have noticed that such a query returns nothing. The reason is right in ROWNUM’s definition: if you want to get only rows 4 through 6, that means you don’t want rows 1 to 3. But that’s the only way to get to rows 4 to 6 with this solution.

Solution 2 offered a slightly more complex syntax that aimed to correct the ROWNUM problem by moving ROWNUM filtering from the WHERE clause to the HAVING clause:

select rownum,       max(Ename) as Ename,       max(Sal) as Salfrom Empwhere sal >= 750group by rownumhaving rownum > = 1and    rownum < = 3;

The following listing shows the output:

  ROWNUM ENAME             SAL——– ———- ———-       1 SMITH             800       2 ALLEN            1600       3 WARD             1250

The solution then called for the user to query to get the next set:

select rownum,       max(Ename) as Ename,       max(Sal) as Salfrom Empwhere sal >= 750group by rownumhaving rownum > = 4and    rownum < = 6;

The following listing shows the output of the above code:

   ROWNUM ENAME             SAL——— ———- ———-        4 JONES            2975        5 MARTIN           1250        6 BLAKE            2850

These results look much better than those for Solution 1, but they are far from perfect. Why?

First of all, this solution uses a GROUP BY clause so the SQL engine will accept the HAVING clause syntax. But using GROUP BY necessitates the MAX() function for each of the columns in the FROM list. That is totally unnecessary and inefficient because it leads to an additional SORT GROUP BY operation. Most importantly, filtering in the HAVING clause applies only after all of the matching rows are retrieved and sorted. So if 100,000 rows match the criteria in the WHERE clause, every time you ask for the next 10 rows, the database will retrieve and sort all 100,000 and then filter them by row number. Extremely inefficient!

In addition, the solution assumes that the database will return rows in the same order for the same SQL query every time. That’s a dangerous assumption! The user cannot guarantee that the database will retrieve rows in the same sequence unless it is enforced by the ORDER BY clause. The ORDER BY clause would force sorting of the whole set of retrieved rows, which is time-consuming and may not be necessary at all. In the example of 100,000 rows, every time you ask for the next 10, the database will retrieve all 100,000 rows, sort them (SORT ORDER BY operation), and then return a batch of 10 rows. That does not sound good at all!

Sort order is not important in certain situations, and an index could be used to avoid using the ORDER BY clause. For example, if you create a composite index on the SAL and ENAME columns, you would get rows sorted by salary and then by name without using the ORDER BY clause and the SORT ORDER BY operation.

Try using an index in your queries:

create index Emp_Sal_Enameon emp(sal, ename)compute statistics;

Even with the GROUP BY clause enforcing the same sequence in the retrieved rows, this solution has a potential problem. Let’s say you’re searching multiple Web sites for a new house. After all, the house market is still strong–unlike the NASDAQ. You set your preferences to sort houses by price and display 10 per page. While you are browsing through the first 10, the database gets updated with three new inexpensive offers. (Living in the San Francisco Bay Area has erased the word cheap from my vocabulary.) That would shift some of the houses you’re viewing three positions down. So what you actually are looking at are no longer rows 1 through 10. You’re seeing the first, third, and fourth, and 7 through 13 (the newly inserted bargains take the second, fifth, and sixth positions).

This should not be a problem at all. After all, databases have to be up-to-date. Plus, the Web site cannot lock the database while you view the data; you are one of many users and the database cannot work exclusively for you. However, what would happen when you click the Next button to get the next 10 records? You would see the first three (11, 12, and 13) repeated again. That would either imply software problems or duplicates (or dupes) in that Web site’s listings.

Packages to the Rescue
Clearly, the problem with the previously discussed solutions is that they try to retrieve results again and again from the beginning. Wouldn’t making the database remember what it had shown in previous batches be better? It would completely eliminate the inefficiency problem that unnecessary sorting causes, and it would even take care of the dupes problem.

What would enable such a mechanism? Packages. I am always surprised at how many Oracle programmers move from SQL to PL/SQL, learn how to write procedures and functions, and yet stop one step short of packages. The author of my favorite Oracle book (“Oracle PL/SQL Programming”), Steven Feuerstein, wrote “Packages are among the least understood and most underutilized features of PL/SQL. That is a shame, because the package structure is also one of the most useful constructs for building well-designed PL/SQL-based applications.”

Packages enable you to keep data in memory (for the quickest access) and share it across all your application components. The data would persist for the duration of the session. Unless you want to persist data to the table(s), permanent or temporary, and worry about how to maintain/clean it up later, packages are exactly what you need.

Develop an API to Perform Retrieval of the Next N Rows
The API I propose will be based on a package with one cursor and multiple procedures to open, close, and fetch rows from the cursor. The cursor stays open throughout the session, and it remembers the fetch history because it automatically sets its pointer to the next available row: fetching the first 10 rows would move the pointer to row 11, the next 10 to row 21, etc.

The package specification consists of five procedures:

  1. OpenCursor
  2. CloseCursor
  3. RetrieveRows
  4. SetRows – to set the default value of rows to fetch
  5. SetSalary – to set the default value of the salary parameter that will be passed to the cursor
And two types:
  1. tName – PL/SQL table of Emp.Ename type
  2. tSal – PL/SQL table of Emp.Sal type
The following listing shows the GetNextSet package specification:
create or replace package GetNextSet as      type tName is   table of Emp.Ename%type   index by binary_integer;      type tSal is   table of Emp.Sal%type   index by binary_integer;      procedure CloseCursor;      procedure OpenCursor (pSalary pls_integer := null);   procedure RetrieveRows (pRows   pls_integer := null,                           pSalary pls_integer := null);                              procedure SetRows (pRows pls_integer);      procedure SetSalary(pSalary pls_integer);end GetNextSet;/

The RetrieveRows procedure does the actual work of retrieving the next N rows. It uses the new Oracle 8i BULK COLLECT clause of the FETCH INTO command with the LIMIT keyword. The BULK COLLECT clause allows us to bulk-fetch rows from the cursor into the vName collection of employees’ names and the vSal collection of salaries, while LIMIT keyword sets the size of the batch:

fetch csrEmployees bulk collect into vName, vSal limit vRows;

That’s it! That one line of code does all the work! Passing batch size as a parameter to RetrieveRows procedure allows you to change it from one fetch to another (i.e., give me the first 10 rows, then the next 25 rows, etc.).

Listing 1 shows the GetNextSet package body.

Use the GetNextSet Package
You use the GetNextSet package by calling its procedures from SQL*Plus. First, you need to enable screen output by setting the SERVEROUTPUT value to ON:

set serveroutput on

Now, start retrieving rows of all employees with salaries greater or equal to 800 in chunks of three (remember, you have 14 total rows in the Emp table):

begin   GetNextSet.RetrieveRows(3, 800);end;/

Here is the output from the above code:

Opening the cursor.SMITH                 800JAMES                 950ADAMS                1100

You’ve sorted the first three rows by salary and then by name, because the cursor definition in your SQL is using the Emp_Sal_Ename index as you planned. Now retrieve the next set:

begin   GetNextSet.RetrieveRows;end;/

Here is the output of the above code:

MARTIN               1250WARD                 1250MILLER               1300

Only 14 rows are returned and you’re probably getting a little impatient. Change the size of the set from three to six:

begin   GetNextSet.RetrieveRows(6);end;/

Now you’re rolling:

TURNER               1500ALLEN                1600CLARK                2450BLAKE                2850JONES                2975FORD                 3000

Now, perform the last fetch:

begin   GetNextSet.RetrieveRows;end;/

Here are the results:

SCOTT                3000KING                 5000All rows retrieved. Closing the cursor.

To try out other procedures in the GetNextSet package, execute the following PL/SQL block, which changes your default settings for set size to five and for salary to 2000:

begin   GetNextSet.SetRows(5);   GetNextSet.SetSalary(2000);end;/

Now you will start retrieving all high-salaried employees in chunks of five:

begin   GetNextSet.RetrieveRows;end;/

Here is the output:

Opening the cursor.CLARK                2450BLAKE                2850JONES                2975FORD                 3000SCOTT                3000

The next fetch:

begin   GetNextSet.RetrieveRows;end;/

appears to be your last one:

KING                 5000All rows retrieved. Closing the cursor.

Looks good and runs instantly! Of course, the table is tiny?only 14 rows. What would happen in a table with hundreds of thousands of rows? I conducted simple performance testing to find out.

As I mentioned previously, the packages solution eliminates the need to retrieve or sort for the consequent requests. The cursor keeps the pointer on the next available row, so you get just the next N rows from the cursor every time. Easy and simple!

Performance Testing Results
I populated a table with 200,000 rows and retrieved rows in chunks of 10 with a total of 100,000 matching my filter criteria. Opening the cursor took 0.2 seconds, and fetching the next 10 rows took 0.15 seconds?regardless of whether it was the first 10 rows or a set in the middle of the 100,000 rows. I consider these good results, particularly in light of Solution #2 from the Erroneous Solutions section. Executing Solution #2 code got slower for subsequent requests and it became really painful as I got to the middle of the 100,000 rows. It took 15-17 seconds, which is completely unacceptable for retrieving just 10 rows.

What to Improve If You Are on Oracle 9i
Oracle 8i imposes a restriction on the usage of the BULK COLLECT clause: the target cannot be a collection of records; it has to be individual collections. That’s why I declared two types (tName and tSal) and initialized two collections (vName and vSal). Oracle lifted this restriction in 9i. You need to just declare a type and initialize a collection of that type.

Two new PL/SQL features in Oracle 8i that I like are Native Dynamic SQL (see New PL/SQL Features in Oracle 8i: Part I) and Bulk Binds (see New PL/SQL Features in Oracle 8i: Part II). The problem is you cannot use them together in 8i. If you open a cursor with Native Dynamic SQL, you cannot perform bulk-fetching. That’s why I couldn’t use Native Dynamic SQL and had to hardcode the cursor in my package body. The current code works only against Emp table and retrieves only two fields, name and salary.

But if you are on Oracle 9i, you could make the cursor definition code more generic by constructing a SQL string using passed in parameters: table name(s), list of fields, WHERE or ORDER BY clauses?all of it could be parameterized. Once your package is more generic, you can use it in many places. (Just to clarify: building a cursor’s SQL dynamically can be done in Oracle 8i also, but then you can not use the BULK BINDS/BULK COLLECT feature and you have to settle for fetching rows one at a time, which is not nearly as efficient as bulk-fetching.)

The good thing about packages is that the existing package structure allows you to add more procedures to support additional functionality (e.g., skipping forward or backward, getting the total amount of sets (pages) with a specified set size, and more). That’s just another example that shows why packages is the way to go in PL/SQL development.

Share the Post:
Heading photo, Metadata.

What is Metadata?

What is metadata? Well, It’s an odd concept to wrap your head around. Metadata is essentially the secondary layer of data that tracks details about the “regular” data. The regular

XDR solutions

The Benefits of Using XDR Solutions

Cybercriminals constantly adapt their strategies, developing newer, more powerful, and intelligent ways to attack your network. Since security professionals must innovate as well, more conventional endpoint detection solutions have evolved

AI is revolutionizing fraud detection

How AI is Revolutionizing Fraud Detection

Artificial intelligence – commonly known as AI – means a form of technology with multiple uses. As a result, it has become extremely valuable to a number of businesses across

AI innovation

Companies Leading AI Innovation in 2023

Artificial intelligence (AI) has been transforming industries and revolutionizing business operations. AI’s potential to enhance efficiency and productivity has become crucial to many businesses. As we move into 2023, several

data fivetran pricing

Fivetran Pricing Explained

One of the biggest trends of the 21st century is the massive surge in analytics. Analytics is the process of utilizing data to drive future decision-making. With so much of

kubernetes logging

Kubernetes Logging: What You Need to Know

Kubernetes from Google is one of the most popular open-source and free container management solutions made to make managing and deploying applications easier. It has a solid architecture that makes