Remove duplicates rows with SQL

2020-01-14 #sql

Last week I made a small update error on my application and I ended up with duplicates values in a table. Of course, this would not have happened if I had a unique key, but as I check before inserting, I thought I was safe.

the-lady-from-shangai
The Lady from Shanghai - Rita Hayworth

Unfortunately, as I couldn’t delete everything and just start updating data again, I had to figure out how to delete duplicates rows.

As a first step, I run a simple query to find out how much I was in trouble.

SELECT Place_ID, Event_ID, StartDate, COUNT(*)
FROM   Showings
GROUP BY Place_ID, Event_ID, StartDate
HAVING COUNT(*) > 1

Good news first: there are no triplets :)

Less good news: I have more than a thousand rows to delete. So no way to do this by running one request after the other…

Good thing: since my table has a primary key, I can identify duplicate data:

SELECT Place_ID, Event_ID, StartDate, MAX(Showing_ID) AS ID
FROM   Showings
GROUP BY Place_ID, Event_ID, StartDate
HAVING COUNT(*) > 1

This way, I find the IDs of all the rows added when there was already a record with the same data (Place_ID, Event_ID and StartDate). I only have to delete these useless values (since the others were there first) :

DELETE
FROM   Showings
WHERE  Showing_ID IN (
                       SELECT MAX(Showing_ID)
                       FROM   Showings
                       GROUP BY Place_ID, Event_ID, StartDate
                       HAVING COUNT(*) > 1
                      )

Sometimes, IT is not that complicated.

Version en français : Supprimer des doublons en SQL.