How to skip duplicate records in SQL?

Free Coding Questions Catalog
Boost your coding skills with our essential coding questions catalog. Take a step towards a better tech career now!

Skipping duplicate records in SQL is a common requirement to ensure data integrity, optimize query performance, and maintain accurate results. Whether you're retrieving data, inserting new records, or performing updates, handling duplicates effectively is essential. This comprehensive guide explores various methods to skip duplicate records in SQL, providing detailed explanations and practical examples for each approach.


1. Understanding Duplicate Records

Duplicate records are rows in a table that have identical values across one or more columns. The definition of a duplicate depends on which columns you consider for comparison. For instance:

  • Exact Duplicates: All columns in the row are identical.
  • Partial Duplicates: Only specific columns have identical values.

Example Scenario:

Consider the Employees table:

EmployeeIDFirstNameLastNameEmailDepartment
1JohnDoejohn.doe@example.comSales
2JaneSmithjane.smith@example.comMarketing
3JohnDoejohn.doe@example.comSales
4AliceJohnsonalice.j@example.comIT
5JohnDoejohn.doe@example.comSales

In this table:

  • Exact Duplicates: Rows with EmployeeID 1, 3, and 5 have identical values across all columns except EmployeeID.
  • Partial Duplicates: Rows with the same Email but different EmployeeIDs.

2. Methods to Skip Duplicate Records in SQL

Depending on your specific requirements—whether you're retrieving data without duplicates, inserting new records while avoiding duplicates, or updating records—you can employ different strategies. Below are the most effective methods categorized based on the operation you intend to perform.

A. Retrieving Data Without Duplicates

When querying data, you might want to skip duplicate records to ensure each record in your result set is unique based on certain criteria.

1. Using the DISTINCT Keyword

The DISTINCT keyword eliminates duplicate rows in the result set. It considers all selected columns; thus, rows are considered duplicates only if all selected columns have identical values.

Syntax:

SELECT DISTINCT column1, column2, ... FROM table_name;

Example: Retrieve unique combinations of FirstName and LastName:

SELECT DISTINCT FirstName, LastName FROM Employees;

Result:

FirstNameLastName
JohnDoe
JaneSmith
AliceJohnson

Note: The EmployeeID and other unique columns are excluded to identify duplicates based on names.

2. Using GROUP BY Clause

The GROUP BY clause groups rows that have the same values in specified columns. It is often used in conjunction with aggregate functions like COUNT, SUM, etc., but can also serve to retrieve unique records.

Syntax:

SELECT column1, column2, ... FROM table_name GROUP BY column1, column2, ...;

Example: Retrieve unique Email addresses:

SELECT Email FROM Employees GROUP BY Email;

Result:

Advantages:

  • Flexible for aggregations.
  • Can be combined with HAVING to filter groups based on aggregate conditions.
3. Using Window Functions (ROW_NUMBER())

Window functions provide advanced capabilities to handle duplicates, especially when you need to retrieve complete records and decide which duplicates to skip based on specific criteria.

Syntax:

WITH CTE AS ( SELECT column1, column2, ..., ROW_NUMBER() OVER (PARTITION BY duplicate_column1, duplicate_column2, ... ORDER BY sort_column) AS rn FROM table_name ) SELECT * FROM CTE WHERE rn = 1;

Example: Retrieve the first occurrence of each Email:

WITH CTE_Duplicates AS ( SELECT EmployeeID, FirstName, LastName, Email, Department, ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn FROM Employees ) SELECT EmployeeID, FirstName, LastName, Email, Department FROM CTE_Duplicates WHERE rn = 1;

Result:

EmployeeIDFirstNameLastNameEmailDepartment
1JohnDoejohn.doe@example.comSales
2JaneSmithjane.smith@example.comMarketing
4AliceJohnsonalice.j@example.comIT

Explanation:

  • The ROW_NUMBER() function assigns a unique sequential number to each row within the partition of Email.
  • By selecting rows where rn = 1, only the first occurrence of each Email is retrieved.
4. Using DISTINCT ON (PostgreSQL Specific)

PostgreSQL offers the DISTINCT ON clause, allowing you to retrieve the first row of each set of duplicates based on specified columns.

Syntax:

SELECT DISTINCT ON (duplicate_column1, duplicate_column2, ...) column1, column2, ... FROM table_name ORDER BY duplicate_column1, duplicate_column2, ..., sort_column;

Example: Retrieve the first EmployeeID for each Email:

SELECT DISTINCT ON (Email) EmployeeID, FirstName, LastName, Email, Department FROM Employees ORDER BY Email, EmployeeID;

Result:

EmployeeIDFirstNameLastNameEmailDepartment
alice.j@example.comAliceJohnsonalice.j@example.comIT
jane.smith@example.comJaneSmithjane.smith@example.comMarketing
john.doe@example.comJohnDoejohn.doe@example.comSales

Note: The ORDER BY clause determines which row is selected within each group of duplicates.

5. Using SELECT DISTINCT with Subqueries

When you need to retrieve specific columns while maintaining the uniqueness based on other columns, subqueries can be effective.

Example: Retrieve all employee details where the Email is unique:

SELECT * FROM Employees e1 WHERE EmployeeID = ( SELECT MIN(EmployeeID) FROM Employees e2 WHERE e2.Email = e1.Email );

Explanation:

  • For each employee, the subquery finds the minimum EmployeeID with the same Email.
  • Only the employee with the smallest EmployeeID for each Email is selected, effectively skipping duplicates.

B. Inserting Data While Skipping Duplicates

When inserting new records, you might want to skip inserting duplicates to maintain data integrity without generating errors.

1. Using INSERT IGNORE (MySQL Specific)

The INSERT IGNORE statement attempts to insert each row but ignores rows that would cause duplicate key violations.

Syntax:

INSERT IGNORE INTO table_name (column1, column2, ...) VALUES (value1, value2, ...), (value3, value4, ...), ...;

Example: Insert new employees, skipping those with duplicate Email:

INSERT IGNORE INTO Employees (EmployeeID, FirstName, LastName, Email, Department) VALUES (6, 'Emily', 'Clark', 'emily.clark@example.com', 'HR'), (7, 'John', 'Doe', 'john.doe@example.com', 'Sales'); -- Duplicate Email

Result:

  • Employee with EmployeeID 6 is inserted.
  • Employee with EmployeeID 7 is skipped due to duplicate Email.
2. Using ON CONFLICT Clause (PostgreSQL Specific)

PostgreSQL allows handling conflicts using the ON CONFLICT clause, specifying actions like doing nothing or updating existing records.

Syntax:

INSERT INTO table_name (column1, column2, ...) VALUES (value1, value2, ...) ON CONFLICT (conflict_column) DO NOTHING;

Example: Insert new employees, skipping those with duplicate Email:

INSERT INTO Employees (EmployeeID, FirstName, LastName, Email, Department) VALUES (6, 'Emily', 'Clark', 'emily.clark@example.com', 'HR'), (7, 'John', 'Doe', 'john.doe@example.com', 'Sales') ON CONFLICT (Email) DO NOTHING;

Result:

  • Employee with EmployeeID 6 is inserted.
  • Employee with EmployeeID 7 is skipped due to duplicate Email.
3. Using MERGE Statement (SQL Server and Oracle)

The MERGE statement allows conditional insertion or updating based on whether a record exists.

Syntax:

MERGE INTO target_table AS target USING source_table AS source ON (target.conflict_column = source.conflict_column) WHEN NOT MATCHED THEN INSERT (column1, column2, ...) VALUES (source.column1, source.column2, ...);

Example: Insert new employees only if Email does not exist:

MERGE INTO Employees AS target USING (VALUES (6, 'Emily', 'Clark', 'emily.clark@example.com', 'HR'), (7, 'John', 'Doe', 'john.doe@example.com', 'Sales') ) AS source (EmployeeID, FirstName, LastName, Email, Department) ON target.Email = source.Email WHEN NOT MATCHED THEN INSERT (EmployeeID, FirstName, LastName, Email, Department) VALUES (source.EmployeeID, source.FirstName, source.LastName, source.Email, source.Department);

Result:

  • Employee with EmployeeID 6 is inserted.
  • Employee with EmployeeID 7 is skipped due to duplicate Email.

C. Updating Records While Skipping Duplicates

When updating records, you might need to skip updating certain records to avoid conflicts or maintain data integrity.

1. Using Conditional Updates

Apply conditions in the WHERE clause to skip duplicates.

Example: Update the Department of employees only if the new Email doesn't already exist.

UPDATE Employees e1 SET Department = 'Research' WHERE e1.EmployeeID = 8 AND NOT EXISTS ( SELECT 1 FROM Employees e2 WHERE e2.Email = 'new.email@example.com' );

Explanation:

  • The update occurs only if no other employee has the Email new.email@example.com.

3. Removing Duplicate Rows While Keeping One Instance

In scenarios where duplicate rows already exist in the table and you want to remove duplicates while retaining one unique instance, various methods can be employed.

1. Using CTE with ROW_NUMBER()

Example: Remove duplicate Email records, keeping the one with the smallest EmployeeID.

WITH CTE_Duplicates AS ( SELECT EmployeeID, ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn FROM Employees ) DELETE FROM Employees WHERE EmployeeID IN ( SELECT EmployeeID FROM CTE_Duplicates WHERE rn > 1 );

Explanation:

  • The CTE assigns a row number to each record partitioned by Email.
  • The DELETE statement removes all records where rn > 1, keeping only the first occurrence.
2. Using Self-Joins

Example: Delete duplicate Email records, keeping the one with the lowest EmployeeID.

DELETE e1 FROM Employees e1 INNER JOIN Employees e2 ON e1.Email = e2.Email AND e1.EmployeeID > e2.EmployeeID;

Explanation:

  • The table joins itself on Email.
  • Records with a higher EmployeeID are deleted, preserving the record with the lowest EmployeeID.
3. Using GROUP BY with Aggregate Functions

Example: Delete duplicates based on Email, keeping the record with the minimum EmployeeID.

DELETE FROM Employees WHERE EmployeeID NOT IN ( SELECT MIN(EmployeeID) FROM Employees GROUP BY Email );

Explanation:

  • The subquery selects the minimum EmployeeID for each unique Email.
  • The DELETE statement removes records not in this list, effectively removing duplicates.
4. Using Temporary Tables

Example: Create a temporary table with unique records and replace the original table.

-- Step 1: Create a temporary table with unique Emails CREATE TABLE Employees_Temp AS SELECT MIN(EmployeeID) AS EmployeeID, FirstName, LastName, Email, Department FROM Employees GROUP BY FirstName, LastName, Email, Department; -- Step 2: Drop the original table DROP TABLE Employees; -- Step 3: Rename the temporary table to the original name ALTER TABLE Employees_Temp RENAME TO Employees;

Note: This method is more involved and should be used with caution, especially in production environments.


4. Best Practices for Handling Duplicates

  1. Implement Proper Constraints:
    • Primary Keys and Unique Constraints: Ensure uniqueness at the database level to prevent duplicates.
      ALTER TABLE Employees ADD CONSTRAINT UC_Email UNIQUE (Email);
  2. Use Transactions:
    • Group related operations within transactions to maintain atomicity.
      BEGIN TRANSACTION; -- Operations to remove duplicates COMMIT TRANSACTION;
  3. Backup Data:
    • Always backup your database before performing bulk delete operations.
      BACKUP DATABASE YourDatabase TO DISK = 'backup.bak';
  4. Validate Data Before Insertion:
    • Implement checks in your application or database to validate data before inserting it.
  5. Regularly Audit and Clean Data:
    • Periodically check for and clean duplicates to maintain data integrity.
  6. Optimize Queries:
    • Use indexes on columns frequently involved in duplicate checks to enhance performance.
      CREATE INDEX idx_email ON Employees (Email);

5. Considerations and Caveats

  • Performance Impact:
    • Methods involving ROW_NUMBER() and CTEs can be resource-intensive on large datasets. Ensure that appropriate indexes are in place to optimize performance.
  • Database-Specific Features:
    • Some SQL dialects offer specific functions or clauses (e.g., DISTINCT ON in PostgreSQL) that can simplify duplicate handling. Familiarize yourself with your DBMS's capabilities.
  • Data Integrity:
    • Ensure that removing duplicates doesn't inadvertently delete essential data. Always review which records are being removed.

6. Example Across Different SQL Dialects

A. SQL Server

Using CTE with ROW_NUMBER():

WITH CTE_Duplicates AS ( SELECT EmployeeID, ROW_NUMBER() OVER (PARTITION BY Email ORDER BY EmployeeID) AS rn FROM Employees ) DELETE FROM Employees WHERE EmployeeID IN ( SELECT EmployeeID FROM CTE_Duplicates WHERE rn > 1 );

B. MySQL

Using Self-Joins:

DELETE e1 FROM Employees e1 INNER JOIN Employees e2 ON e1.Email = e2.Email AND e1.EmployeeID > e2.EmployeeID;

Note: In MySQL versions prior to 8.0, window functions like ROW_NUMBER() are not supported, making self-joins a viable option.

C. PostgreSQL

Using DISTINCT ON:

CREATE TABLE Employees_Temp AS SELECT DISTINCT ON (Email) * FROM Employees ORDER BY Email, EmployeeID; DROP TABLE Employees; ALTER TABLE Employees_Temp RENAME TO Employees;

7. Conclusion

Skipping duplicate records in SQL is vital for maintaining a clean, efficient, and reliable database. Whether you're retrieving unique data, inserting new records without causing duplicates, or cleaning existing data, understanding and utilizing the appropriate methods ensures data integrity and optimal performance.

Key Takeaways:

  • Use DISTINCT and GROUP BY to retrieve unique records based on specific columns.
  • Employ Window Functions like ROW_NUMBER() for advanced duplicate handling.
  • Leverage Database-Specific Features such as DISTINCT ON in PostgreSQL for streamlined operations.
  • Implement Preventive Measures including primary keys, unique constraints, and proper data validation to avoid duplicates.
  • Always Backup and Test before performing bulk delete operations to safeguard against accidental data loss.

By mastering these techniques and adhering to best practices, you can effectively manage duplicate records in your SQL databases, ensuring data accuracy and system reliability.

TAGS
Coding Interview
System Design Interview
CONTRIBUTOR
Design Gurus Team

GET YOUR FREE

Coding Questions Catalog

Design Gurus Newsletter - Latest from our Blog
Boost your coding skills with our essential coding questions catalog.
Take a step towards a better tech career now!
Explore Answers
What is Dependency Injection in System Design?
Is Python hard to learn?
Which is best AWS or Snowflake?
Related Courses
Image
Grokking the Coding Interview: Patterns for Coding Questions
Image
Grokking Data Structures & Algorithms for Coding Interviews
Image
Grokking Advanced Coding Patterns for Interviews
Image
One-Stop Portal For Tech Interviews.
Copyright © 2024 Designgurus, Inc. All rights reserved.