Some 3rd party apps do not have the best schema design and further to that it is sometimes not possible to modify it w/o breaking changes to future releases of the Application/DB. This can lead to duplicate records entering the System. Below is a method to remedy that situation and delete duplicate records. Ideally, the schema get's modified also and uniqueness is enforced at the data tier.
Data Prep: Create a Table & Insert Duplicate Records (No PK; No Uniqueness)
Create Table City (InsertedDate varchar(100), CityName varchar(100), StateName varchar(100), Zip varchar(100))
Insert into City Select '2016-07-01', 'Miami', 'Florida', '33101' Insert into City Select '2016-07-01', 'Miami', 'Florida', '33101' Insert into City Select '2016-07-01', 'Miami', 'Florida', '33101' Insert into City Select '2016-07-01', 'Miami', 'Florida', '33101' Insert into City Select '2016-07-01', 'Los Angeles', 'California', '90001' Insert into City Select '2016-07-01', 'Los Angeles', 'California', '90001' Insert into City Select '2016-07-01', 'Los Angeles', 'California', '90001' Insert into City Select '2016-07-01', 'Boston', 'Massachusetts', '02116' Insert into City Select '2016-07-01', 'Boston', 'Massachusetts', '02116'
Select * from dbo.City
T-SQL: Create a CTE (Common Table Expression) with Row_Number Over Partition By
For duplicate records, each gets a unique Row Number starting at 1 and increment. Then you can delete all records where RN is not equal to 1 and this removes the duplicate records in City table!!!
With CTEDeleteAllDuplicates AS (Select Row_Number() Over (Partition by InsertedDate, CityName, StateName, Zip Order By InsertedDate) as RN, InsertedDate, CityName, StateName, Zip From dbo.City )
Select * from CTEDeleteAllDuplicates --Delete from CTEDeleteAllDuplicates where RN <> 1
Re-run the whole CTE again with the Delete Statement to remove the duplicate records
--Select * from CTEDeleteAllDuplicates Delete from CTEDeleteAllDuplicates where RN <> 1
Check that all duplicates are now deleted
Select * from dbo.City
Notes:
Common Table Expression (CTE) is a temporary named result set that you can reference within a SELECT/INSERT/UPDATE/DELETE statement. Deleting from the CTE actually deletes records from the table!!! You can reference the CTE in the statement that immediately follows the CTE Definition. (Since it is the same Query) Partition by is similar to a Group By
Resources:
WITH common_table_expression (Transact-SQL)
ROW_NUMBER (Transact-SQL)
OVER Clause (Transact-SQL) ...A window function
|
|
|
|
|