How to find duplicates in SQL?

Free Coding Questions Catalog

Boost your coding skills with our essential coding questions catalog. Take a step towards a better tech career now!

Finding duplicates in SQL is a common task that involves identifying records in a table that share the same values in one or more columns. Duplicates can lead to data inconsistencies and inaccuracies, so it's essential to detect and handle them appropriately. Below are several methods to find duplicates in SQL, along with examples to illustrate each approach.

1. Using `GROUP BY` and `HAVING` Clauses

This is the most straightforward method to identify duplicate records based on specific columns.

Steps:

Select the columns you want to check for duplicates.
Group the results by those columns.
Use the HAVING clause to filter groups that have a count greater than one.

Example: Suppose you have a table named Employees with the following columns: EmployeeID, FirstName, LastName, Email, and Department. You want to find duplicate email addresses.

SELECT Email, COUNT(*)
FROM Employees
GROUP BY Email
HAVING COUNT(*) > 1;

Explanation:

SELECT Email, COUNT(*): Selects the Email column and counts the number of occurrences for each email.
FROM Employees: Specifies the table to query.
GROUP BY Email: Groups the results by the Email column.
HAVING COUNT(*) > 1: Filters the groups to include only those with more than one occurrence, indicating duplicates.

2. Using Window Functions (`ROW_NUMBER`)

Window functions provide a more flexible approach, especially when you need to retrieve complete duplicate records, not just the duplicate values.

Steps:

Use the ROW_NUMBER() function to assign a unique sequential number to rows within a partition of specified columns.
Filter the results to include only rows where the row number is greater than one, indicating duplicates.

Example: Find all duplicate records based on FirstName and LastName.

SELECT *
FROM (
    SELECT 
        *,
        ROW_NUMBER() OVER (PARTITION BY FirstName, LastName ORDER BY EmployeeID) AS rn
    FROM Employees
) AS sub
WHERE rn > 1;

Explanation:

ROW_NUMBER() OVER (PARTITION BY FirstName, LastName ORDER BY EmployeeID): Assigns a unique row number to each record within the partition of FirstName and LastName, ordered by EmployeeID.
WHERE rn > 1: Filters out the first occurrence, leaving only duplicates.

3. Using Self-Joins

Self-joins can be used to compare rows within the same table to identify duplicates.

Steps:

Join the table to itself based on the columns you want to check for duplicates.
Ensure that you exclude identical rows by using a condition like t1.EmployeeID <> t2.EmployeeID.

Example: Identify duplicate email addresses using a self-join.

SELECT t1.*
FROM Employees t1
JOIN Employees t2 ON t1.Email = t2.Email
WHERE t1.EmployeeID <> t2.EmployeeID;

Explanation:

JOIN Employees t2 ON t1.Email = t2.Email: Joins the table Employees to itself where the Email values match.
WHERE t1.EmployeeID <> t2.EmployeeID: Ensures that the joined rows are different records, effectively identifying duplicates.

**4. Using `COUNT(*)` with `GROUP BY` for Multiple Columns**

To find duplicates based on multiple columns, extend the GROUP BY clause accordingly.

Example: Find duplicates based on FirstName, LastName, and Department.

SELECT FirstName, LastName, Department, COUNT(*)
FROM Employees
GROUP BY FirstName, LastName, Department
HAVING COUNT(*) > 1;

Explanation:

GROUP BY FirstName, LastName, Department: Groups records by the combination of FirstName, LastName, and Department.
HAVING COUNT(*) > 1: Identifies groups with more than one record, indicating duplicates.

5. Using `EXISTS` Clause

The EXISTS clause can help identify duplicate records by checking the existence of another record with the same values.

Example: Find duplicate email addresses.

SELECT *
FROM Employees e1
WHERE EXISTS (
    SELECT 1
    FROM Employees e2
    WHERE e1.Email = e2.Email
    AND e1.EmployeeID <> e2.EmployeeID
);

Explanation:

EXISTS: Checks if there is at least one other record with the same Email.
e1.EmployeeID <> e2.EmployeeID: Ensures that the duplicate is not the same record.

6. Using `DISTINCT` and `COUNT` to Compare Unique Counts

This method helps in understanding the extent of duplication by comparing total records with unique records.

Example: Determine how many duplicate email addresses exist.

SELECT 
    COUNT(*) - COUNT(DISTINCT Email) AS DuplicateCount
FROM Employees;

Explanation:

COUNT(*): Counts all records.
COUNT(DISTINCT Email): Counts unique email addresses.
COUNT(*) - COUNT(DISTINCT Email): Calculates the number of duplicate email addresses.

7. Using Common Table Expressions (CTEs) for Clarity

CTEs can make complex queries more readable and manageable.

Example: Find duplicate FirstName and LastName using a CTE.

WITH DuplicateNames AS (
    SELECT 
        FirstName, 
        LastName, 
        COUNT(*) AS NameCount
    FROM Employees
    GROUP BY FirstName, LastName
    HAVING COUNT(*) > 1
)
SELECT e.*
FROM Employees e
JOIN DuplicateNames dn
ON e.FirstName = dn.FirstName
AND e.LastName = dn.LastName;

Explanation:

WITH DuplicateNames AS (...): Defines a CTE that identifies duplicate names.
JOIN DuplicateNames dn ON ...: Retrieves all records that have duplicate FirstName and LastName.

Considerations When Finding Duplicates

Defining Duplicates:
Clearly define which columns should be unique. Sometimes, duplicates are based on a single column (e.g., email), while other times, a combination of columns defines uniqueness (e.g., FirstName + LastName).
Performance:
For large datasets, methods using GROUP BY and HAVING are generally more efficient. However, window functions can offer more flexibility for complex scenarios.
Handling NULLs:
Depending on the SQL dialect, NULL values can affect duplicate detection. For example, in some systems, multiple NULL values are considered duplicates, while others treat them as distinct.
Case Sensitivity:
Be aware of case sensitivity in your database. For instance, 'John.Doe@example.com' and 'john.doe@example.com' might be treated as duplicates or distinct based on the collation settings.
Removing Duplicates:
While identifying duplicates is crucial, you might also need to remove them. Common approaches include using DELETE statements with subqueries or window functions to retain only the first occurrence of each duplicate.

Example: Removing Duplicate Emails While Retaining the First Entry

WITH RankedEmployees AS (
    SELECT 
        EmployeeID,
        Email,
        ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn
    FROM Employees
)
DELETE FROM Employees
WHERE EmployeeID IN (
    SELECT EmployeeID
    FROM RankedEmployees
    WHERE rn > 1
);

Explanation:

ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID): Assigns a unique row number within each partition of Email.
WHERE rn > 1: Identifies all duplicate records beyond the first occurrence.
DELETE: Removes the duplicate records from the Employees table.

Conclusion

Finding duplicates in SQL is essential for maintaining data integrity and ensuring accurate analyses. Depending on your specific requirements and the complexity of your data, you can choose from various methods such as using GROUP BY with HAVING, window functions like ROW_NUMBER(), self-joins, or CTEs for more readable queries. Always consider the performance implications and ensure that your duplicate detection logic aligns with your data's unique constraints and business rules.

By mastering these techniques, you can efficiently identify and manage duplicate records, contributing to cleaner and more reliable databases.

How to find duplicates in SQL?

1. Using GROUP BY and HAVING Clauses

2. Using Window Functions (ROW_NUMBER)

3. Using Self-Joins

4. Using COUNT(*) with GROUP BY for Multiple Columns

5. Using EXISTS Clause

6. Using DISTINCT and COUNT to Compare Unique Counts

7. Using Common Table Expressions (CTEs) for Clarity

Considerations When Finding Duplicates

Conclusion

1. Using `GROUP BY` and `HAVING` Clauses

2. Using Window Functions (`ROW_NUMBER`)

**4. Using `COUNT(*)` with `GROUP BY` for Multiple Columns**

5. Using `EXISTS` Clause

6. Using `DISTINCT` and `COUNT` to Compare Unique Counts