How to remove duplicates in SQL?

Free Coding Questions Catalog
Boost your coding skills with our essential coding questions catalog. Take a step towards a better tech career now!

Removing duplicates in SQL is a common task that helps maintain data integrity and ensures the accuracy of your database. Duplicates can arise due to various reasons, such as data entry errors, import issues, or lack of proper constraints. This guide will walk you through identifying and removing duplicate records using different methods in SQL, complete with examples to illustrate each approach.

1. Understanding Duplicates in SQL

Duplicates in a database refer to records where certain columns have identical values across multiple rows. The definition of a duplicate depends on which columns you consider for comparison. For example, two rows might have the same Email but different EmployeeIDs, or they might have identical values across all columns except for a unique identifier.

2. Identifying Duplicates

Before removing duplicates, it's essential to identify them. Here's how you can find duplicate records based on specific columns.

Example Scenario:

Consider a table Employees with the following structure:

EmployeeIDFirstNameLastNameEmailDepartment
1JohnDoejohn.doe@example.comSales
2JaneSmithjane.smith@example.comMarketing
3JohnDoejohn.doe@example.comSales
4AliceJohnsonalice.j@example.comIT
5JohnDoejohn.doe@example.comSales

Identifying Duplicate Emails:

SELECT Email, COUNT(*) FROM Employees GROUP BY Email HAVING COUNT(*) > 1;

Result:

EmailCOUNT(*)
john.doe@example.com3

This query shows that the email john.doe@example.com appears three times in the Employees table, indicating duplicates.

3. Methods to Remove Duplicates

There are several methods to remove duplicates in SQL. Below are the most common and effective approaches:

a. Using Common Table Expressions (CTEs) with ROW_NUMBER()

This method assigns a unique sequential number to each row within a partition of specified columns. Rows with a ROW_NUMBER greater than 1 are considered duplicates and can be deleted.

Steps:

  1. Use a CTE to select all records and assign row numbers partitioned by the columns that define duplicates.
  2. Delete records where the row number is greater than 1.

Example:

WITH CTE_Duplicates AS ( SELECT EmployeeID, FirstName, LastName, Email, Department, ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn FROM Employees ) DELETE FROM Employees WHERE EmployeeID IN ( SELECT EmployeeID FROM CTE_Duplicates WHERE rn > 1 );

Explanation:

  • The CTE CTE_Duplicates partitions the Employees table by Email and assigns a row number (rn) to each record within the partition.
  • The DELETE statement removes all records from Employees where the EmployeeID is in the set of duplicates (rn > 1), effectively keeping only the first occurrence.

b. Using a Subquery with ROW_NUMBER()

Similar to the CTE method, but implemented directly with a subquery.

Example:

DELETE e FROM Employees e INNER JOIN ( SELECT EmployeeID, ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn FROM Employees ) dup ON e.EmployeeID = dup.EmployeeID WHERE dup.rn > 1;

Explanation:

  • The subquery assigns a row number to each record partitioned by Email.
  • The INNER JOIN matches the original table with the subquery on EmployeeID.
  • The WHERE clause deletes records where rn > 1, removing duplicates.

c. Using Self-Joins

This method involves joining the table to itself to identify duplicates.

Example:

DELETE e1 FROM Employees e1 INNER JOIN Employees e2 ON e1.Email = e2.Email AND e1.EmployeeID > e2.EmployeeID;

Explanation:

  • The table Employees is joined to itself (e1 and e2) based on the Email column.
  • The condition e1.EmployeeID > e2.EmployeeID ensures that for duplicates, only the record with the higher EmployeeID is deleted.
  • This effectively keeps the first occurrence and removes subsequent duplicates.

d. Creating a Temporary Table with Distinct Records

This approach involves creating a new table with unique records and replacing the original table.

Steps:

  1. Create a new table with the same structure as the original.
  2. Insert distinct records into the new table.
  3. Drop the original table.
  4. Rename the new table to the original name.

Example:

-- Step 1: Create a new table CREATE TABLE Employees_Temp ( EmployeeID INT PRIMARY KEY, FirstName VARCHAR(50), LastName VARCHAR(50), Email VARCHAR(100) UNIQUE, Department VARCHAR(50) ); -- Step 2: Insert distinct records INSERT INTO Employees_Temp (EmployeeID, FirstName, LastName, Email, Department) SELECT MIN(EmployeeID), FirstName, LastName, Email, Department FROM Employees GROUP BY FirstName, LastName, Email, Department; -- Step 3: Drop the original table DROP TABLE Employees; -- Step 4: Rename the new table EXEC sp_rename 'Employees_Temp', 'Employees';

Explanation:

  • Employees_Temp is created to hold unique records.
  • The INSERT statement selects the minimum EmployeeID for each group of duplicates.
  • The original Employees table is dropped, and Employees_Temp is renamed to Employees.

Note: This method can be more involved and may not be suitable for large tables or production environments without proper backups and transaction handling.

e. Using GROUP BY and Aggregate Functions to Identify Duplicates

While primarily used for identification, you can also use it in combination with other methods to remove duplicates.

Example:

DELETE FROM Employees WHERE EmployeeID NOT IN ( SELECT MIN(EmployeeID) FROM Employees GROUP BY Email );

Explanation:

  • The subquery selects the minimum EmployeeID for each unique Email.
  • The DELETE statement removes records whose EmployeeID is not in the list of minimum EmployeeIDs, effectively removing duplicates.

4. Choosing the Right Method

The best method to remove duplicates depends on several factors, including:

  • Database Size: Methods using CTEs and window functions are generally more efficient for large datasets.
  • Complexity of Duplicates: If duplicates are based on multiple columns, ensure your method accounts for all relevant columns.
  • Database System: Some SQL dialects have specific syntax or features that can optimize duplicate removal.
  • Performance Considerations: Always test your queries to assess their performance, especially on large tables.

5. Best Practices

  • Backup Your Data: Before performing delete operations, always back up your data to prevent accidental loss.

  • Test Queries: Use SELECT statements with your conditions to verify which records will be affected before executing DELETE.

    Example:

    WITH CTE_Duplicates AS ( SELECT EmployeeID, ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn FROM Employees ) SELECT * FROM CTE_Duplicates WHERE rn > 1;
  • Use Transactions: Wrap your delete operations within transactions to allow rollback in case of errors.

    Example:

    BEGIN TRANSACTION; WITH CTE_Duplicates AS ( SELECT EmployeeID, ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn FROM Employees ) DELETE FROM Employees WHERE EmployeeID IN ( SELECT EmployeeID FROM CTE_Duplicates WHERE rn > 1 ); -- Verify the changes SELECT * FROM Employees; -- If everything is correct, commit COMMIT TRANSACTION; -- If there are issues, rollback -- ROLLBACK TRANSACTION;
  • Maintain Primary Keys and Unique Constraints: Properly defined primary keys and unique constraints prevent duplicates from occurring in the first place.

6. Preventing Duplicates

While removing duplicates is essential, preventing them is even better. Here are some strategies:

  • Define Primary Keys: Ensure each table has a primary key to uniquely identify records.

    Example:

    CREATE TABLE Employees ( EmployeeID INT PRIMARY KEY, FirstName VARCHAR(50), LastName VARCHAR(50), Email VARCHAR(100) UNIQUE, Department VARCHAR(50) );
  • Use Unique Constraints: Apply UNIQUE constraints on columns that should not have duplicate values.

    Example:

    ALTER TABLE Employees ADD CONSTRAINT UC_Email UNIQUE (Email);
  • Implement Indexes: Unique indexes can enforce uniqueness at the database level.

    Example:

    CREATE UNIQUE INDEX idx_unique_email ON Employees (Email);
  • Data Validation in Applications: Ensure that the application layer validates data before inserting it into the database to prevent duplicates.

7. Conclusion

Removing duplicates in SQL is crucial for maintaining the integrity and reliability of your data. By leveraging methods such as Common Table Expressions with ROW_NUMBER(), self-joins, subqueries, and temporary tables, you can effectively identify and eliminate duplicate records based on your specific criteria. Additionally, implementing preventive measures like primary keys and unique constraints can help avoid the occurrence of duplicates, ensuring a cleaner and more efficient database environment.

Remember:

  • Always back up your data before performing bulk delete operations.
  • Test your duplicate identification queries to ensure accuracy.
  • Use transactions to safeguard against unintended data loss.
  • Implement preventive constraints to maintain data integrity moving forward.

By following these practices and understanding the various methods available, you'll be well-equipped to manage duplicates in your SQL databases efficiently.

TAGS
Coding Interview
System Design Interview
CONTRIBUTOR
Design Gurus Team

GET YOUR FREE

Coding Questions Catalog

Design Gurus Newsletter - Latest from our Blog
Boost your coding skills with our essential coding questions catalog.
Take a step towards a better tech career now!
Explore Answers
What is intern salary?
How to prepare to a Microsoft interview?
What is the job structure of Netflix?
Related Courses
Image
Grokking the Coding Interview: Patterns for Coding Questions
Image
Grokking Data Structures & Algorithms for Coding Interviews
Image
Grokking Advanced Coding Patterns for Interviews
Image
One-Stop Portal For Tech Interviews.
Copyright © 2024 Designgurus, Inc. All rights reserved.