Remove duplicate rows with Oracle SQL ROW_NUMBER – intelligent and efficient data cleansing

Lucas Jellema 1

I am quite sure I have addressed this before. But I could not find it readily and I needed it today. So here it goes:

I am loading data from some external source. Well actually, from several external sources. The loading of the data is done in a better safe than sorry way: it is very important that all records come in. At the risk of loading duplicates – we will get them all!

However, once they are all in, there are more in than we want: I have loaded quite a few duplicate records. That will hurt the later on, so the duplicates have to go. With Oracle SQL Analytical Functions, that is quite easy. In two steps:

1) identify the duplicates

2) remove the duplicates

The data I am dealing with is Olympic Medals – over the period 1984-2008. I have found several sources on the internet. I have written a simple Java program that uses JSoup for screenscraping and subsequently invokes a PL/SQL package to create database records.

The key table – the one that is potentially loaded with the duplicate records – is called MEDAL_SCORE. It records for a country (cty_id) how many medals of a certain metal (gold, silver, bronze; column medal) it has won during a specific edition of the Summer Olympics (oly_id).

Image

The query to identify duplicates is fairly easy: when the partition defined by oly_id, cty_id and medal has more than one record – we have a duplicate. We need to say only once how many medals of a medal type a country has one in an Olympiad. Therefore the duplicate counting query becomes:

select count(*)
from   ( select id
         ,      row_number() over ( partition by oly_id, evt_id, cty_id,medal
                                    order by id) rn
         from   medal_score
       )
where  rn>1

When I execute this query, I discover that I am currently at 1098 duplicates – out of a total of 2544 records. Well, as long as the important ones are all there, that is no issue. Especially since removing the duplicates is as simple as:

delete from medal_score
where  rowid in
( select rwid
  from ( select rowid rwid
         ,      row_number() over ( partition by oly_id, evt_id, cty_id,medal
                                    order by id) rn
         from   medal_score
       )
  where  rn>1
)

Image

One thought on “Remove duplicate rows with Oracle SQL ROW_NUMBER – intelligent and efficient data cleansing

  1. Alternative without analytic functions:
    delete from medal_score
    where rowid in (
    select rowid
    from medal_score t
    where exists (
    select 1
    from medal_score
    where oly_id = t.oly_id
    and cty_id = t.cty_id
    and medal = t.medal
    and rowid > t.rowid
    )
    );
     
     

Comments are closed.

Next Post

Screenscraping from Java using jsoup - effective data gathering from websites

Facebook0TwitterLinkedinIn a recent article I discussed screenscraping in a in hindsight fairly clumsy way (https://technology.amis.nl/blog/12786/building-java-object-graph-with-tour-de-france-results-using-screen-scraping-java-util-parser-and-assorted-facilities). While preparing for a series of articles on data visualizations, I had need of statistics regarding the Olympic Games – more specifically: the overall medal count per country during the 2008 Bejing Olympic Games. This […]