How to find duplicates in SQL?

Free Coding Questions Catalog
Boost your coding skills with our essential coding questions catalog. Take a step towards a better tech career now!

Finding duplicates in SQL is a common task that involves identifying records in a table that share the same values in one or more columns. Duplicates can lead to data inconsistencies and inaccuracies, so it's essential to detect and handle them appropriately. Below are several methods to find duplicates in SQL, along with examples to illustrate each approach.

1. Using GROUP BY and HAVING Clauses

This is the most straightforward method to identify duplicate records based on specific columns.

Steps:

  1. Select the columns you want to check for duplicates.
  2. Group the results by those columns.
  3. Use the HAVING clause to filter groups that have a count greater than one.

Example: Suppose you have a table named Employees with the following columns: EmployeeID, FirstName, LastName, Email, and Department. You want to find duplicate email addresses.

SELECT Email, COUNT(*) FROM Employees GROUP BY Email HAVING COUNT(*) > 1;

Explanation:

  • SELECT Email, COUNT(*): Selects the Email column and counts the number of occurrences for each email.
  • FROM Employees: Specifies the table to query.
  • GROUP BY Email: Groups the results by the Email column.
  • HAVING COUNT(*) > 1: Filters the groups to include only those with more than one occurrence, indicating duplicates.

2. Using Window Functions (ROW_NUMBER)

Window functions provide a more flexible approach, especially when you need to retrieve complete duplicate records, not just the duplicate values.

Steps:

  1. Use the ROW_NUMBER() function to assign a unique sequential number to rows within a partition of specified columns.
  2. Filter the results to include only rows where the row number is greater than one, indicating duplicates.

Example: Find all duplicate records based on FirstName and LastName.

SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY FirstName, LastName ORDER BY EmployeeID) AS rn FROM Employees ) AS sub WHERE rn > 1;

Explanation:

  • ROW_NUMBER() OVER (PARTITION BY FirstName, LastName ORDER BY EmployeeID): Assigns a unique row number to each record within the partition of FirstName and LastName, ordered by EmployeeID.
  • WHERE rn > 1: Filters out the first occurrence, leaving only duplicates.

3. Using Self-Joins

Self-joins can be used to compare rows within the same table to identify duplicates.

Steps:

  1. Join the table to itself based on the columns you want to check for duplicates.
  2. Ensure that you exclude identical rows by using a condition like t1.EmployeeID <> t2.EmployeeID.

Example: Identify duplicate email addresses using a self-join.

SELECT t1.* FROM Employees t1 JOIN Employees t2 ON t1.Email = t2.Email WHERE t1.EmployeeID <> t2.EmployeeID;

Explanation:

  • JOIN Employees t2 ON t1.Email = t2.Email: Joins the table Employees to itself where the Email values match.
  • WHERE t1.EmployeeID <> t2.EmployeeID: Ensures that the joined rows are different records, effectively identifying duplicates.

4. Using COUNT(*) with GROUP BY for Multiple Columns

To find duplicates based on multiple columns, extend the GROUP BY clause accordingly.

Example: Find duplicates based on FirstName, LastName, and Department.

SELECT FirstName, LastName, Department, COUNT(*) FROM Employees GROUP BY FirstName, LastName, Department HAVING COUNT(*) > 1;

Explanation:

  • GROUP BY FirstName, LastName, Department: Groups records by the combination of FirstName, LastName, and Department.
  • HAVING COUNT(*) > 1: Identifies groups with more than one record, indicating duplicates.

5. Using EXISTS Clause

The EXISTS clause can help identify duplicate records by checking the existence of another record with the same values.

Example: Find duplicate email addresses.

SELECT * FROM Employees e1 WHERE EXISTS ( SELECT 1 FROM Employees e2 WHERE e1.Email = e2.Email AND e1.EmployeeID <> e2.EmployeeID );

Explanation:

  • EXISTS: Checks if there is at least one other record with the same Email.
  • e1.EmployeeID <> e2.EmployeeID: Ensures that the duplicate is not the same record.

6. Using DISTINCT and COUNT to Compare Unique Counts

This method helps in understanding the extent of duplication by comparing total records with unique records.

Example: Determine how many duplicate email addresses exist.

SELECT COUNT(*) - COUNT(DISTINCT Email) AS DuplicateCount FROM Employees;

Explanation:

  • COUNT(*): Counts all records.
  • COUNT(DISTINCT Email): Counts unique email addresses.
  • COUNT(*) - COUNT(DISTINCT Email): Calculates the number of duplicate email addresses.

7. Using Common Table Expressions (CTEs) for Clarity

CTEs can make complex queries more readable and manageable.

Example: Find duplicate FirstName and LastName using a CTE.

WITH DuplicateNames AS ( SELECT FirstName, LastName, COUNT(*) AS NameCount FROM Employees GROUP BY FirstName, LastName HAVING COUNT(*) > 1 ) SELECT e.* FROM Employees e JOIN DuplicateNames dn ON e.FirstName = dn.FirstName AND e.LastName = dn.LastName;

Explanation:

  • WITH DuplicateNames AS (...): Defines a CTE that identifies duplicate names.
  • JOIN DuplicateNames dn ON ...: Retrieves all records that have duplicate FirstName and LastName.

Considerations When Finding Duplicates

  1. Defining Duplicates:
    Clearly define which columns should be unique. Sometimes, duplicates are based on a single column (e.g., email), while other times, a combination of columns defines uniqueness (e.g., FirstName + LastName).

  2. Performance:
    For large datasets, methods using GROUP BY and HAVING are generally more efficient. However, window functions can offer more flexibility for complex scenarios.

  3. Handling NULLs:
    Depending on the SQL dialect, NULL values can affect duplicate detection. For example, in some systems, multiple NULL values are considered duplicates, while others treat them as distinct.

  4. Case Sensitivity:
    Be aware of case sensitivity in your database. For instance, 'John.Doe@example.com' and 'john.doe@example.com' might be treated as duplicates or distinct based on the collation settings.

  5. Removing Duplicates:
    While identifying duplicates is crucial, you might also need to remove them. Common approaches include using DELETE statements with subqueries or window functions to retain only the first occurrence of each duplicate.

Example: Removing Duplicate Emails While Retaining the First Entry

WITH RankedEmployees AS ( SELECT EmployeeID, Email, ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn FROM Employees ) DELETE FROM Employees WHERE EmployeeID IN ( SELECT EmployeeID FROM RankedEmployees WHERE rn > 1 );

Explanation:

  • ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID): Assigns a unique row number within each partition of Email.
  • WHERE rn > 1: Identifies all duplicate records beyond the first occurrence.
  • DELETE: Removes the duplicate records from the Employees table.

Conclusion

Finding duplicates in SQL is essential for maintaining data integrity and ensuring accurate analyses. Depending on your specific requirements and the complexity of your data, you can choose from various methods such as using GROUP BY with HAVING, window functions like ROW_NUMBER(), self-joins, or CTEs for more readable queries. Always consider the performance implications and ensure that your duplicate detection logic aligns with your data's unique constraints and business rules.

By mastering these techniques, you can efficiently identify and manage duplicate records, contributing to cleaner and more reliable databases.

TAGS
Coding Interview
System Design Interview
CONTRIBUTOR
Design Gurus Team

GET YOUR FREE

Coding Questions Catalog

Design Gurus Newsletter - Latest from our Blog
Boost your coding skills with our essential coding questions catalog.
Take a step towards a better tech career now!
Explore Answers
What is Tesla's motto?
What is Apple's motto?
What is a software engineer trainee?
Related Courses
Image
Grokking the Coding Interview: Patterns for Coding Questions
Image
Grokking Data Structures & Algorithms for Coding Interviews
Image
Grokking Advanced Coding Patterns for Interviews
Image
One-Stop Portal For Tech Interviews.
Copyright © 2024 Designgurus, Inc. All rights reserved.