Remove duplicate rows with Oracle SQL ROW_NUMBER – intelligent and efficient data cleansing

Lucas Jellema 1
0 0
Read Time:1 Minute, 57 Second

I am quite sure I have addressed this before. But I could not find it readily and I needed it today. So here it goes:

I am loading data from some external source. Well actually, from several external sources. The loading of the data is done in a better safe than sorry way: it is very important that all records come in. At the risk of loading duplicates – we will get them all!

However, once they are all in, there are more in than we want: I have loaded quite a few duplicate records. That will hurt the later on, so the duplicates have to go. With Oracle SQL Analytical Functions, that is quite easy. In two steps:

1) identify the duplicates

2) remove the duplicates

The data I am dealing with is Olympic Medals – over the period 1984-2008. I have found several sources on the internet. I have written a simple Java program that uses JSoup for screenscraping and subsequently invokes a PL/SQL package to create database records.

The key table – the one that is potentially loaded with the duplicate records – is called MEDAL_SCORE. It records for a country (cty_id) how many medals of a certain metal (gold, silver, bronze; column medal) it has won during a specific edition of the Summer Olympics (oly_id).

Image

The query to identify duplicates is fairly easy: when the partition defined by oly_id, cty_id and medal has more than one record – we have a duplicate. We need to say only once how many medals of a medal type a country has one in an Olympiad. Therefore the duplicate counting query becomes:

select count(*)
from   ( select id
         ,      row_number() over ( partition by oly_id, evt_id, cty_id,medal
                                    order by id) rn
         from   medal_score
       )
where  rn>1

When I execute this query, I discover that I am currently at 1098 duplicates – out of a total of 2544 records. Well, as long as the important ones are all there, that is no issue. Especially since removing the duplicates is as simple as:

delete from medal_score
where  rowid in
( select rwid
  from ( select rowid rwid
         ,      row_number() over ( partition by oly_id, evt_id, cty_id,medal
                                    order by id) rn
         from   medal_score
       )
  where  rn>1
)

Image

About Post Author

Lucas Jellema

Lucas Jellema, active in IT (and with Oracle) since 1994. Oracle ACE Director and Oracle Developer Champion. Solution architect and developer on diverse areas including SQL, JavaScript, Kubernetes & Docker, Machine Learning, Java, SOA and microservices, events in various shapes and forms and many other things. Author of the Oracle Press book Oracle SOA Suite 12c Handbook. Frequent presenter on user groups and community events and conferences such as JavaOne, Oracle Code, CodeOne, NLJUG JFall and Oracle OpenWorld.
Happy
Happy
0 %
Sad
Sad
0 %
Excited
Excited
0 %
Sleepy
Sleepy
0 %
Angry
Angry
0 %
Surprise
Surprise
0 %

Average Rating

5 Star
0%
4 Star
0%
3 Star
0%
2 Star
0%
1 Star
0%

One thought on “Remove duplicate rows with Oracle SQL ROW_NUMBER – intelligent and efficient data cleansing

  1. Alternative without analytic functions:
    delete from medal_score
    where rowid in (
    select rowid
    from medal_score t
    where exists (
    select 1
    from medal_score
    where oly_id = t.oly_id
    and cty_id = t.cty_id
    and medal = t.medal
    and rowid > t.rowid
    )
    );
     
     

Comments are closed.

Next Post

Screenscraping from Java using jsoup - effective data gathering from websites

In a recent article I discussed screenscraping in a in hindsight fairly clumsy way (https://technology.amis.nl/blog/12786/building-java-object-graph-with-tour-de-france-results-using-screen-scraping-java-util-parser-and-assorted-facilities). While preparing for a series of articles on data visualizations, I had need of statistics regarding the Olympic Games – more specifically: the overall medal count per country during the 2008 Bejing Olympic Games. This […]
%d bloggers like this: