info@techdevops.com | 437-991-3573 | Data Engineering Services
TechDevOps.com
Resources Tools
Experts in Microsoft SQL Server on Windows, Linux, Containers | Clusters, Always On, FCI | Migrations, Cloud, Performance



T-SQL Delete Duplicate Records
by BF (Principal Consultant; Architecture; Engineering)
2016-12-06









Some 3rd party apps do not have the best schema design and further to that it is sometimes not possible to modify it w/o breaking
changes to future releases of the Application/DB. This can lead to duplicate records entering the System. Below is a method to remedy
that situation and delete duplicate records. Ideally, the schema get's modified also and uniqueness is enforced at the data tier.



Data Prep: Create a Table & Insert Duplicate Records (No PK; No Uniqueness)

Create Table City (InsertedDate varchar(100), CityName varchar(100), StateName varchar(100), Zip varchar(100))

Insert into City Select '2016-07-01', 'Miami', 'Florida', '33101'
Insert into City Select '2016-07-01', 'Miami', 'Florida', '33101'
Insert into City Select '2016-07-01', 'Miami', 'Florida', '33101'
Insert into City Select '2016-07-01', 'Miami', 'Florida', '33101'
Insert into City Select '2016-07-01', 'Los Angeles', 'California', '90001'
Insert into City Select '2016-07-01', 'Los Angeles', 'California', '90001'
Insert into City Select '2016-07-01', 'Los Angeles', 'California', '90001'
Insert into City Select '2016-07-01', 'Boston', 'Massachusetts', '02116'
Insert into City Select '2016-07-01', 'Boston', 'Massachusetts', '02116'


Select * from dbo.City






T-SQL: Create a CTE (Common Table Expression) with Row_Number Over Partition By


For duplicate records, each gets a unique Row Number starting at 1 and increment. Then you can delete all records
where RN is not equal to 1 and this removes the duplicate records in City table!!!


With CTEDeleteAllDuplicates
AS
(Select
Row_Number() Over (Partition by InsertedDate, CityName, StateName, Zip Order By InsertedDate) as RN,
InsertedDate,
CityName,
StateName,
Zip
From
dbo.City
)

Select * from CTEDeleteAllDuplicates
--Delete from CTEDeleteAllDuplicates where RN <> 1







Re-run the whole CTE again with the Delete Statement to remove the duplicate records

--Select * from CTEDeleteAllDuplicates
Delete from CTEDeleteAllDuplicates where RN <> 1






Check that all duplicates are now deleted

Select * from dbo.City







Notes:

Common Table Expression (CTE) is a temporary named result set that you can reference within a SELECT/INSERT/UPDATE/DELETE statement.
Deleting from the CTE actually deletes records from the table!!!
You can reference the CTE in the statement that immediately follows the CTE Definition. (Since it is the same Query)
Partition by is similar to a Group By



Resources:

WITH common_table_expression (Transact-SQL)

ROW_NUMBER (Transact-SQL)

OVER Clause (Transact-SQL)
...A window function