187. Repeated DNA Sequences - Detailed Explanation

Problem Statement

You are given a string s that represents a DNA sequence, where each character is one of 'A', 'C', 'G', or 'T'. Your task is to find all the 10-letter-long sequences (substrings) that occur more than once in this DNA molecule. You can return the answer in any order.

Example 1
Input: "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
Explanation:
The substring "AAAAACCCCC" appears twice, and the substring "CCCCCAAAAA" appears twice. These are the 10-letter sequences that repeat.

Example 2
Input: "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]
Explanation:
The 10-letter substring "AAAAAAAAAA" appears multiple times in overlapping fashion.

Constraints:

  • (1 \leq s.length \leq 10^5)
  • s[i] is one of 'A', 'C', 'G', or 'T'.

Hints

  1. Sliding Window: Think about how you can use a sliding window of length 10 to iterate over every 10-letter substring of the input string.

  2. Hashing/Counting: Consider using a hash set or a hash map (dictionary) to count how many times each 10-letter substring appears as you slide over the string.

Approach 1: Hash Map Counting

Explanation

In this straightforward approach, you slide a window of size 10 over the string and extract every substring of length 10. For each substring, you use a hash map (or dictionary) to count its occurrences.

Step-by-Step:

  1. Initialize Data Structures:

    • A dictionary (or hash map) to store each 10-letter substring and its count.
    • A list to collect the substrings that have been seen more than once.
  2. Sliding Window:

    • Iterate from the start of the string until ( \text{len}(s) - 10 + 1 ).
    • For each position, extract the substring of length 10.
  3. Counting:

    • If the substring is not in the dictionary, add it with a count of 1.
    • If it is already present and its count becomes 2 (indicating it has been seen more than once), add it to the result list.
  4. Return the Result:

    • Return the list of repeated 10-letter sequences.

Complexity Analysis

  • Time Complexity: (O(n)) — You make one pass over the string using a sliding window, where (n) is the length of the string.

  • Space Complexity: (O(n)) — In the worst case, you might store almost every substring of length 10 in the dictionary.

Approach 2: Bit Manipulation (Rolling Hash)

Explanation

Since the DNA sequence contains only 4 characters, you can map each character to 2 bits (for example, A → 00, C → 01, G → 10, T → 11). This allows you to represent each 10-letter sequence as a 20-bit integer, which can significantly improve the efficiency of comparing and hashing these sequences.

Step-by-Step:

  1. Mapping:

    • Create a mapping from character to its bit representation.
  2. Initial Hash Calculation:

    • Calculate the integer value for the first 10-letter substring.
  3. Rolling Hash:

    • As you slide the window, update the hash value by:
      • Removing the contribution of the leftmost character.
      • Adding the contribution of the new rightmost character.
    • Use bitwise operations to update the hash value efficiently.
  4. Counting:

    • Use a hash map or two sets to record sequences that have been seen and those that are repeated.
  5. Return the Result:

    • Convert the integer representations back to the substring if needed or track the original substring along with the hash.

Complexity Analysis

  • Time Complexity: (O(n)) — Each substring is processed in constant time using bitwise operations.

  • Space Complexity: (O(n)) — You need additional space for storing hash values or seen sequences.

Code Implementations

Python Code

Python3
Python3

. . . .

Java Code

Java
Java

. . . .

Common Mistakes and Edge Cases

  • Edge Case – String Length Less Than 10:
    Always check if the string length is less than 10. In such cases, there can be no 10-letter-long substring.

  • Overlapping Sequences:
    Remember that the substrings may overlap. For example, in "AAAAAAAAAAAAA", the repeated sequence "AAAAAAAAAA" overlaps and should still be counted.

  • Incorrect Data Structures:
    Using inefficient data structures may lead to timeouts or high memory usage, especially when processing long DNA strings.

Alternative Variations

  • Different Substring Lengths:
    A variation of the problem could ask for repeated sequences of a different fixed length.

  • Finding All Anagrams:
    Instead of repeated sequences, you might be asked to find all starting indices of substrings that are anagrams of a given pattern.

  • Generalized Pattern Matching:
    Another variation might involve more complex pattern matching within the string rather than fixed-length substrings.

TAGS
leetcode
CONTRIBUTOR
Design Gurus Team
-

GET YOUR FREE

Coding Questions Catalog

Design Gurus Newsletter - Latest from our Blog
Boost your coding skills with our essential coding questions catalog.
Take a step towards a better tech career now!
Explore Answers
Related Courses
Grokking the Coding Interview: Patterns for Coding Questions
Grokking the Coding Interview Patterns in Java, Python, JS, C++, C#, and Go. The most comprehensive course with 476 Lessons.
4.6
(69,299 learners)
New
Grokking Modern AI Fundamentals
Master the fundamentals of AI today to lead the tech revolution of tomorrow.
3.9
(1,107 learners)
Grokking Data Structures & Algorithms for Coding Interviews
Unlock Coding Interview Success: Dive Deep into Data Structures and Algorithms.
4
(26,683 learners)
Image
One-Stop Portal For Tech Interviews.
Copyright © 2025 Design Gurus, LLC. All rights reserved.