187. Repeated DNA Sequences - Detailed Explanation

Problem Statement

You are given a string s that represents a DNA sequence, where each character is one of 'A', 'C', 'G', or 'T'. Your task is to find all the 10-letter-long sequences (substrings) that occur more than once in this DNA molecule. You can return the answer in any order.

Example 1
Input: "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
Explanation:
The substring "AAAAACCCCC" appears twice, and the substring "CCCCCAAAAA" appears twice. These are the 10-letter sequences that repeat.

Example 2
Input: "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]
Explanation:
The 10-letter substring "AAAAAAAAAA" appears multiple times in overlapping fashion.

Constraints:

$(1 \leq s.length \leq 10^5)$
s[i] is one of 'A', 'C', 'G', or 'T'.

Hints

Sliding Window: Think about how you can use a sliding window of length 10 to iterate over every 10-letter substring of the input string.
Hashing/Counting: Consider using a hash set or a hash map (dictionary) to count how many times each 10-letter substring appears as you slide over the string.

Approach 1: Hash Map Counting

Explanation

In this straightforward approach, you slide a window of size 10 over the string and extract every substring of length 10. For each substring, you use a hash map (or dictionary) to count its occurrences.

Step-by-Step:

Initialize Data Structures:
- A dictionary (or hash map) to store each 10-letter substring and its count.
- A list to collect the substrings that have been seen more than once.
Sliding Window:
- Iterate from the start of the string until $( \text{len}(s) - 10 + 1 )$ .
- For each position, extract the substring of length 10.
Counting:
- If the substring is not in the dictionary, add it with a count of 1.
- If it is already present and its count becomes 2 (indicating it has been seen more than once), add it to the result list.
Return the Result:
- Return the list of repeated 10-letter sequences.

Complexity Analysis

Time Complexity: (O(n)) — You make one pass over the string using a sliding window, where (n) is the length of the string.
Space Complexity: (O(n)) — In the worst case, you might store almost every substring of length 10 in the dictionary.

Approach 2: Bit Manipulation (Rolling Hash)

Explanation

Since the DNA sequence contains only 4 characters, you can map each character to 2 bits (for example, A → 00, C → 01, G → 10, T → 11). This allows you to represent each 10-letter sequence as a 20-bit integer, which can significantly improve the efficiency of comparing and hashing these sequences.

Step-by-Step:

Mapping:
- Create a mapping from character to its bit representation.
Initial Hash Calculation:
- Calculate the integer value for the first 10-letter substring.
Rolling Hash:
- As you slide the window, update the hash value by:
  - Removing the contribution of the leftmost character.
  - Adding the contribution of the new rightmost character.
- Use bitwise operations to update the hash value efficiently.
Counting:
- Use a hash map or two sets to record sequences that have been seen and those that are repeated.
Return the Result:
- Convert the integer representations back to the substring if needed or track the original substring along with the hash.

Complexity Analysis

Time Complexity: (O(n)) — Each substring is processed in constant time using bitwise operations.
Space Complexity: (O(n)) — You need additional space for storing hash values or seen sequences.

Code Implementations

Python Code

Python3

. . . .

Java Code

Java

. . . .

Common Mistakes and Edge Cases

Edge Case – String Length Less Than 10:
Always check if the string length is less than 10. In such cases, there can be no 10-letter-long substring.
Overlapping Sequences:
Remember that the substrings may overlap. For example, in "AAAAAAAAAAAAA", the repeated sequence "AAAAAAAAAA" overlaps and should still be counted.
Incorrect Data Structures:
Using inefficient data structures may lead to timeouts or high memory usage, especially when processing long DNA strings.

Alternative Variations

Different Substring Lengths:
A variation of the problem could ask for repeated sequences of a different fixed length.
Finding All Anagrams:
Instead of repeated sequences, you might be asked to find all starting indices of substrings that are anagrams of a given pattern.
Generalized Pattern Matching:
Another variation might involve more complex pattern matching within the string rather than fixed-length substrings.

Find All Anagrams in a String (LeetCode 438)

187. Repeated DNA Sequences - Detailed Explanation

Problem Statement

Hints

Approach 1: Hash Map Counting

Explanation

Step-by-Step:

Complexity Analysis

Approach 2: Bit Manipulation (Rolling Hash)

Explanation

Step-by-Step:

Complexity Analysis

Code Implementations

Python Code

Java Code

Common Mistakes and Edge Cases

Alternative Variations

Related Problems