How to skip duplicate records in SQL?

Skipping duplicate records in SQL is a common requirement to ensure data integrity, optimize query performance, and maintain accurate results. Whether you're retrieving data, inserting new records, or performing updates, handling duplicates effectively is essential. This comprehensive guide explores various methods to skip duplicate records in SQL, providing detailed explanations and practical examples for each approach.

1. Understanding Duplicate Records

Duplicate records are rows in a table that have identical values across one or more columns. The definition of a duplicate depends on which columns you consider for comparison. For instance:

Exact Duplicates: All columns in the row are identical.
Partial Duplicates: Only specific columns have identical values.

Example Scenario:

Consider the Employees table:

EmployeeID	FirstName	LastName	Email	Department
1	John	Doe	john.doe@example.com	Sales
2	Jane	Smith	jane.smith@example.com	Marketing
3	John	Doe	john.doe@example.com	Sales
4	Alice	Johnson	alice.j@example.com	IT
5	John	Doe	john.doe@example.com	Sales

In this table:

Exact Duplicates: Rows with EmployeeID 1, 3, and 5 have identical values across all columns except EmployeeID.
Partial Duplicates: Rows with the same Email but different EmployeeIDs.

2. Methods to Skip Duplicate Records in SQL

Depending on your specific requirements—whether you're retrieving data without duplicates, inserting new records while avoiding duplicates, or updating records—you can employ different strategies. Below are the most effective methods categorized based on the operation you intend to perform.

A. Retrieving Data Without Duplicates

When querying data, you might want to skip duplicate records to ensure each record in your result set is unique based on certain criteria.

1. Using the `DISTINCT` Keyword

The DISTINCT keyword eliminates duplicate rows in the result set. It considers all selected columns; thus, rows are considered duplicates only if all selected columns have identical values.

Syntax:

SELECT DISTINCT column1, column2, ...
FROM table_name;

Example: Retrieve unique combinations of FirstName and LastName:

SELECT DISTINCT FirstName, LastName
FROM Employees;

Result:

FirstName	LastName
John	Doe
Jane	Smith
Alice	Johnson

Note: The EmployeeID and other unique columns are excluded to identify duplicates based on names.

2. Using `GROUP BY` Clause

The GROUP BY clause groups rows that have the same values in specified columns. It is often used in conjunction with aggregate functions like COUNT, SUM, etc., but can also serve to retrieve unique records.

Syntax:

SELECT column1, column2, ...
FROM table_name
GROUP BY column1, column2, ...;

Example: Retrieve unique Email addresses:

SELECT Email
FROM Employees
GROUP BY Email;

Result:

Email
john.doe@example.com
jane.smith@example.com
alice.j@example.com

Advantages:

Flexible for aggregations.
Can be combined with HAVING to filter groups based on aggregate conditions.

3. Using Window Functions (`ROW_NUMBER()`)

Window functions provide advanced capabilities to handle duplicates, especially when you need to retrieve complete records and decide which duplicates to skip based on specific criteria.

Syntax:

WITH CTE AS (
    SELECT 
        column1, 
        column2, 
        ..., 
        ROW_NUMBER() OVER (PARTITION BY duplicate_column1, duplicate_column2, ... ORDER BY sort_column) AS rn
    FROM table_name
)
SELECT *
FROM CTE
WHERE rn = 1;

Example: Retrieve the first occurrence of each Email:

WITH CTE_Duplicates AS (
    SELECT 
        EmployeeID,
        FirstName,
        LastName,
        Email,
        Department,
        ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn
    FROM Employees
)
SELECT EmployeeID, FirstName, LastName, Email, Department
FROM CTE_Duplicates
WHERE rn = 1;

Result:

EmployeeID	FirstName	LastName	Email	Department
1	John	Doe	john.doe@example.com	Sales
2	Jane	Smith	jane.smith@example.com	Marketing
4	Alice	Johnson	alice.j@example.com	IT

Explanation:

The ROW_NUMBER() function assigns a unique sequential number to each row within the partition of Email.
By selecting rows where rn = 1, only the first occurrence of each Email is retrieved.

4. Using `DISTINCT ON` (PostgreSQL Specific)

PostgreSQL offers the DISTINCT ON clause, allowing you to retrieve the first row of each set of duplicates based on specified columns.

Syntax:

SELECT DISTINCT ON (duplicate_column1, duplicate_column2, ...) column1, column2, ...
FROM table_name
ORDER BY duplicate_column1, duplicate_column2, ..., sort_column;

Example: Retrieve the first EmployeeID for each Email:

SELECT DISTINCT ON (Email) EmployeeID, FirstName, LastName, Email, Department
FROM Employees
ORDER BY Email, EmployeeID;

Result:

EmployeeID	FirstName	LastName	Email	Department
alice.j@example.com	Alice	Johnson	alice.j@example.com	IT
jane.smith@example.com	Jane	Smith	jane.smith@example.com	Marketing
john.doe@example.com	John	Doe	john.doe@example.com	Sales

Note: The ORDER BY clause determines which row is selected within each group of duplicates.

5. Using `SELECT DISTINCT` with Subqueries

When you need to retrieve specific columns while maintaining the uniqueness based on other columns, subqueries can be effective.

Example: Retrieve all employee details where the Email is unique:

SELECT *
FROM Employees e1
WHERE EmployeeID = (
    SELECT MIN(EmployeeID)
    FROM Employees e2
    WHERE e2.Email = e1.Email
);

Explanation:

For each employee, the subquery finds the minimum EmployeeID with the same Email.
Only the employee with the smallest EmployeeID for each Email is selected, effectively skipping duplicates.

B. Inserting Data While Skipping Duplicates

When inserting new records, you might want to skip inserting duplicates to maintain data integrity without generating errors.

1. Using `INSERT IGNORE` (MySQL Specific)

The INSERT IGNORE statement attempts to insert each row but ignores rows that would cause duplicate key violations.

Syntax:

INSERT IGNORE INTO table_name (column1, column2, ...)
VALUES (value1, value2, ...), (value3, value4, ...), ...;

Example: Insert new employees, skipping those with duplicate Email:

INSERT IGNORE INTO Employees (EmployeeID, FirstName, LastName, Email, Department)
VALUES 
    (6, 'Emily', 'Clark', 'emily.clark@example.com', 'HR'),
    (7, 'John', 'Doe', 'john.doe@example.com', 'Sales'); -- Duplicate Email

Result:

Employee with EmployeeID 6 is inserted.
Employee with EmployeeID 7 is skipped due to duplicate Email.

2. Using `ON CONFLICT` Clause (PostgreSQL Specific)

PostgreSQL allows handling conflicts using the ON CONFLICT clause, specifying actions like doing nothing or updating existing records.

Syntax:

INSERT INTO table_name (column1, column2, ...)
VALUES (value1, value2, ...)
ON CONFLICT (conflict_column) DO NOTHING;

Example: Insert new employees, skipping those with duplicate Email:

INSERT INTO Employees (EmployeeID, FirstName, LastName, Email, Department)
VALUES 
    (6, 'Emily', 'Clark', 'emily.clark@example.com', 'HR'),
    (7, 'John', 'Doe', 'john.doe@example.com', 'Sales') 
ON CONFLICT (Email) DO NOTHING;

Result:

Employee with EmployeeID 6 is inserted.
Employee with EmployeeID 7 is skipped due to duplicate Email.

3. Using `MERGE` Statement (SQL Server and Oracle)

The MERGE statement allows conditional insertion or updating based on whether a record exists.

Syntax:

MERGE INTO target_table AS target
USING source_table AS source
ON (target.conflict_column = source.conflict_column)
WHEN NOT MATCHED THEN
    INSERT (column1, column2, ...)
    VALUES (source.column1, source.column2, ...);

Example: Insert new employees only if Email does not exist:

MERGE INTO Employees AS target
USING (VALUES 
    (6, 'Emily', 'Clark', 'emily.clark@example.com', 'HR'),
    (7, 'John', 'Doe', 'john.doe@example.com', 'Sales')
) AS source (EmployeeID, FirstName, LastName, Email, Department)
ON target.Email = source.Email
WHEN NOT MATCHED THEN
    INSERT (EmployeeID, FirstName, LastName, Email, Department)
    VALUES (source.EmployeeID, source.FirstName, source.LastName, source.Email, source.Department);

Result:

Employee with EmployeeID 6 is inserted.
Employee with EmployeeID 7 is skipped due to duplicate Email.

C. Updating Records While Skipping Duplicates

When updating records, you might need to skip updating certain records to avoid conflicts or maintain data integrity.

1. Using Conditional Updates

Apply conditions in the WHERE clause to skip duplicates.

Example: Update the Department of employees only if the new Email doesn't already exist.

UPDATE Employees e1
SET Department = 'Research'
WHERE e1.EmployeeID = 8
AND NOT EXISTS (
    SELECT 1 FROM Employees e2 
    WHERE e2.Email = 'new.email@example.com'
);

Explanation:

The update occurs only if no other employee has the Email new.email@example.com.

3. Removing Duplicate Rows While Keeping One Instance

In scenarios where duplicate rows already exist in the table and you want to remove duplicates while retaining one unique instance, various methods can be employed.

1. Using CTE with `ROW_NUMBER()`

Example: Remove duplicate Email records, keeping the one with the smallest EmployeeID.

WITH CTE_Duplicates AS (
    SELECT 
        EmployeeID,
        ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn
    FROM Employees
)
DELETE FROM Employees
WHERE EmployeeID IN (
    SELECT EmployeeID
    FROM CTE_Duplicates
    WHERE rn > 1
);

Explanation:

The CTE assigns a row number to each record partitioned by Email.
The DELETE statement removes all records where rn > 1, keeping only the first occurrence.

2. Using Self-Joins

Example: Delete duplicate Email records, keeping the one with the lowest EmployeeID.

DELETE e1
FROM Employees e1
INNER JOIN Employees e2 
    ON e1.Email = e2.Email 
    AND e1.EmployeeID > e2.EmployeeID;

Explanation:

The table joins itself on Email.
Records with a higher EmployeeID are deleted, preserving the record with the lowest EmployeeID.

3. Using `GROUP BY` with Aggregate Functions

Example: Delete duplicates based on Email, keeping the record with the minimum EmployeeID.

DELETE FROM Employees
WHERE EmployeeID NOT IN (
    SELECT MIN(EmployeeID)
    FROM Employees
    GROUP BY Email
);

Explanation:

The subquery selects the minimum EmployeeID for each unique Email.
The DELETE statement removes records not in this list, effectively removing duplicates.

4. Using Temporary Tables

Example: Create a temporary table with unique records and replace the original table.

-- Step 1: Create a temporary table with unique Emails
CREATE TABLE Employees_Temp AS
SELECT MIN(EmployeeID) AS EmployeeID, FirstName, LastName, Email, Department
FROM Employees
GROUP BY FirstName, LastName, Email, Department;

-- Step 2: Drop the original table
DROP TABLE Employees;

-- Step 3: Rename the temporary table to the original name
ALTER TABLE Employees_Temp RENAME TO Employees;

Note: This method is more involved and should be used with caution, especially in production environments.

4. Best Practices for Handling Duplicates

Implement Proper Constraints:
- Primary Keys and Unique Constraints: Ensure uniqueness at the database level to prevent duplicates.
```
ALTER TABLE Employees
ADD CONSTRAINT UC_Email UNIQUE (Email);
```
Use Transactions:
- Group related operations within transactions to maintain atomicity.
```
BEGIN TRANSACTION;

-- Operations to remove duplicates

COMMIT TRANSACTION;
```
Backup Data:
- Always backup your database before performing bulk delete operations.
```
BACKUP DATABASE YourDatabase TO DISK = 'backup.bak';
```
Validate Data Before Insertion:
- Implement checks in your application or database to validate data before inserting it.
Regularly Audit and Clean Data:
- Periodically check for and clean duplicates to maintain data integrity.
Optimize Queries:
- Use indexes on columns frequently involved in duplicate checks to enhance performance.
```
CREATE INDEX idx_email ON Employees (Email);
```

5. Considerations and Caveats

Performance Impact:
- Methods involving ROW_NUMBER() and CTEs can be resource-intensive on large datasets. Ensure that appropriate indexes are in place to optimize performance.
Database-Specific Features:
- Some SQL dialects offer specific functions or clauses (e.g., DISTINCT ON in PostgreSQL) that can simplify duplicate handling. Familiarize yourself with your DBMS's capabilities.
Data Integrity:
- Ensure that removing duplicates doesn't inadvertently delete essential data. Always review which records are being removed.

6. Example Across Different SQL Dialects

A. SQL Server

Using CTE with ROW_NUMBER():

WITH CTE_Duplicates AS (
    SELECT 
        EmployeeID,
        ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn
    FROM Employees
)
DELETE FROM Employees
WHERE EmployeeID IN (
    SELECT EmployeeID
    FROM CTE_Duplicates
    WHERE rn > 1
);

B. MySQL

Using Self-Joins:

DELETE e1
FROM Employees e1
INNER JOIN Employees e2 
    ON e1.Email = e2.Email 
    AND e1.EmployeeID > e2.EmployeeID;

Note: In MySQL versions prior to 8.0, window functions like ROW_NUMBER() are not supported, making self-joins a viable option.

C. PostgreSQL

Using DISTINCT ON:

CREATE TABLE Employees_Temp AS
SELECT DISTINCT ON (Email) *
FROM Employees
ORDER BY Email, EmployeeID;

DROP TABLE Employees;

ALTER TABLE Employees_Temp RENAME TO Employees;

7. Conclusion

Skipping duplicate records in SQL is vital for maintaining a clean, efficient, and reliable database. Whether you're retrieving unique data, inserting new records without causing duplicates, or cleaning existing data, understanding and utilizing the appropriate methods ensures data integrity and optimal performance.

Key Takeaways:

Use DISTINCT and GROUP BY to retrieve unique records based on specific columns.
Employ Window Functions like ROW_NUMBER() for advanced duplicate handling.
Leverage Database-Specific Features such as DISTINCT ON in PostgreSQL for streamlined operations.
Implement Preventive Measures including primary keys, unique constraints, and proper data validation to avoid duplicates.
Always Backup and Test before performing bulk delete operations to safeguard against accidental data loss.

By mastering these techniques and adhering to best practices, you can effectively manage duplicate records in your SQL databases, ensuring data accuracy and system reliability.

How to skip duplicate records in SQL?

1. Understanding Duplicate Records

2. Methods to Skip Duplicate Records in SQL

A. Retrieving Data Without Duplicates

1. Using the DISTINCT Keyword

2. Using GROUP BY Clause

3. Using Window Functions (ROW_NUMBER())

4. Using DISTINCT ON (PostgreSQL Specific)

5. Using SELECT DISTINCT with Subqueries

B. Inserting Data While Skipping Duplicates

1. Using INSERT IGNORE (MySQL Specific)

2. Using ON CONFLICT Clause (PostgreSQL Specific)

3. Using MERGE Statement (SQL Server and Oracle)

C. Updating Records While Skipping Duplicates

1. Using Conditional Updates

3. Removing Duplicate Rows While Keeping One Instance

1. Using CTE with ROW_NUMBER()

2. Using Self-Joins

3. Using GROUP BY with Aggregate Functions

4. Using Temporary Tables

4. Best Practices for Handling Duplicates

5. Considerations and Caveats

6. Example Across Different SQL Dialects

A. SQL Server

B. MySQL

C. PostgreSQL

7. Conclusion

1. Using the `DISTINCT` Keyword

2. Using `GROUP BY` Clause

3. Using Window Functions (`ROW_NUMBER()`)

4. Using `DISTINCT ON` (PostgreSQL Specific)

5. Using `SELECT DISTINCT` with Subqueries

1. Using `INSERT IGNORE` (MySQL Specific)

2. Using `ON CONFLICT` Clause (PostgreSQL Specific)

3. Using `MERGE` Statement (SQL Server and Oracle)

1. Using CTE with `ROW_NUMBER()`

3. Using `GROUP BY` with Aggregate Functions