Deleting Duplicate rows the smart way using OVER PARTITION in MS SQL
Tags: General, Coldfusion, ColdFusion 9, Database
A little while ago I had read an interesting blog about deleting duplicate records by Kumar Shah, now he used CTE to do his magic or Common Table Expressions, but failed to understand or deliver the real magic. As you can see by the comments on his blog, others don't seem to fully understand this magic either.
The other thing to point out before I go into depth, I started using CTE's a lot more and the OVER PARTITION and have to say it is really a great thing that has been added to MS SQL 2005.
So why is so great?
Well before we answer that question, lets first build a table and populate it with some data so you can get a feel for it yourself.
( Emp_Name varchar(10)
, Company varchar(15)
, Join_Date datetime
, Resigned_Date datetime
)
Now lets put some data into the table where we have at least one duplicate and one triplicate, so that we can show how we can achieve more than you would normally, than not using over partition.
values ('John', 'Software', '20060101', '20061231')
,('John', 'Software', '20060101', '20061231')
,('John', 'Software', '20060101', '20061231')
,('John', 'SuperSoft', '20070101', '20071231')
,('John', 'UltraSoft', '20070201', '20080131')
,('John', 'ImproSoft', '20080201', '20081231')
,('John', 'ImproSoft', '20080201', '20081231')
,('Mary', 'Software', '20060101', '20081231')
,('Mary', 'SuperSoft', '20090101', '20090531')
,('Mary', 'SuperSoft', '20090101', '20090531')
,('Mary', 'UltraSoft', '20090601', '20100531')
,('Mary', 'UltraSoft', '20090601', '20100531')
Now although there are 12 rows that will be inserted into the table, only 7 of these records are actually considered to be good records. So if we where to run a query the old fashioned was we would have to write the query like this.
group by Emp_Name, Company, Join_Date, Resigned_Date
If we ran this sql we would see that we get 12 results as expected, and we can see which ones has more than one occurrence. But how do we decide on which occurrence we should delete?
One could assume that the first should be the one we leave and we remove the rest, but your rules could be different in what ever application you are writing. So for this example we shall take into consideration that we want to keep the first record, and everything else should be deleted.
Now you will also notice that in this example there is also no PK, this was also by design to show that we can do further magic to do what we need to do. But as always one should always backup their database before playing around with queries like this, just in case.
So lets modify the first select query to now look like this.
,Company
,Join_Date
,Resigned_Date
,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date
,Resigned_Date
order by Emp_Name, Company, Join_Date
,Resigned_Date) RowNumber
from Emp_Details
Now this is the interesting bit, because we have now added a row number to our data. And we can sort on any of the criteria that we so desire, and we can see how this looks below.
And as you can see that the row is reset for each duplicate telling us which records have more than one row, and we can now take this an remove the data with a query with something like this.
from
(select Emp_Name, Company, Join_Date, Resigned_Date
,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date
,Resigned_Date
order by Emp_Name, Company, Join_Date
,Resigned_Date) RowNumber
from Emp_Details) a
where a.RowNumber > 1
Which is basically telling us that we are going to remove anything from the database, where the RowNumber is greater than 1. But you will also notice something else here, we no longer have any group by or having clauses to worry about. And that has got to be one of the best things that I have seen added in a long time, and granted there is a lot more to MS SQL 2005 / 2008 that I am still discovering.
So in conclusion why would you use this method over other previous methods, like using the group by clause? The answer is efficiency, as the table is only scanned once, we only have one derived table as we don't need to join on anything and last but not least we are using less CPU and Memory utilization.
There are no comments for this entry.



TweetBacks