Remove duplicate rows with Oracle SQL ROW_NUMBER – intelligent and efficient data cleansing
I am quite sure I have addressed this before. But I could not find it readily and I needed it today. So here it goes:
I am loading data from some external source. Well actually, from several external sources. The loading of the data is done in a better safe than sorry way: it is very important that all records come in. At the risk of loading duplicates – we will get them all!
However, once they are all in, there are more in than we want: I have loaded quite a few duplicate records. That will hurt the later on, so the duplicates have to go. With Oracle SQL Analytical Functions, that is quite easy. In two steps:
1) identify the duplicates
2) remove the duplicates
The data I am dealing with is Olympic Medals – over the period 1984-2008. I have found several sources on the internet. I have written a simple Java program that uses JSoup for screenscraping and subsequently invokes a PL/SQL package to create database records.
The key table – the one that is potentially loaded with the duplicate records – is called MEDAL_SCORE. It records for a country (cty_id) how many medals of a certain metal (gold, silver, bronze; column medal) it has won during a specific edition of the Summer Olympics (oly_id).
The query to identify duplicates is fairly easy: when the partition defined by oly_id, cty_id and medal has more than one record – we have a duplicate. We need to say only once how many medals of a medal type a country has one in an Olympiad. Therefore the duplicate counting query becomes:
select count(*) from ( select id , row_number() over ( partition by oly_id, evt_id, cty_id,medal order by id) rn from medal_score ) where rn>1
When I execute this query, I discover that I am currently at 1098 duplicates – out of a total of 2544 records. Well, as long as the important ones are all there, that is no issue. Especially since removing the duplicates is as simple as:
delete from medal_score where rowid in ( select rwid from ( select rowid rwid , row_number() over ( partition by oly_id, evt_id, cty_id,medal order by id) rn from medal_score ) where rn>1 )
- Have MERGE remove records from Target that are not in the Source – Oracle 10g
- Oracle & JDBC – Passing a 'table of custom object type' to a stored procedure – implementing efficient single round trip data exchange (part two of an ADF application on a PL/SQL API)
- The Oracle 11g UNPIVOT operator – turning columns into rows
- Courtesy of Tom Kyte: Generating rows in SQL with the CUBE statement – no dummy table or table function required
- Oracle BPEL Process Manager: Most efficient way to manage dynamic process level (cross instance) data