187. Repeated DNA Sequences - Detailed Explanation
Problem Statement
You are given a string s
that represents a DNA sequence, where each character is one of 'A'
, 'C'
, 'G'
, or 'T'
. Your task is to find all the 10-letter-long sequences (substrings) that occur more than once in this DNA molecule. You can return the answer in any order.
Example 1
Input: "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
Explanation:
The substring "AAAAACCCCC"
appears twice, and the substring "CCCCCAAAAA"
appears twice. These are the 10-letter sequences that repeat.
Example 2
Input: "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]
Explanation:
The 10-letter substring "AAAAAAAAAA"
appears multiple times in overlapping fashion.
Constraints:
- (1 \leq s.length \leq 10^5)
s[i]
is one of'A'
,'C'
,'G'
, or'T'
.
Hints
-
Sliding Window: Think about how you can use a sliding window of length 10 to iterate over every 10-letter substring of the input string.
-
Hashing/Counting: Consider using a hash set or a hash map (dictionary) to count how many times each 10-letter substring appears as you slide over the string.
Approach 1: Hash Map Counting
Explanation
In this straightforward approach, you slide a window of size 10 over the string and extract every substring of length 10. For each substring, you use a hash map (or dictionary) to count its occurrences.
Step-by-Step:
-
Initialize Data Structures:
- A dictionary (or hash map) to store each 10-letter substring and its count.
- A list to collect the substrings that have been seen more than once.
-
Sliding Window:
- Iterate from the start of the string until ( \text{len}(s) - 10 + 1 ).
- For each position, extract the substring of length 10.
-
Counting:
- If the substring is not in the dictionary, add it with a count of 1.
- If it is already present and its count becomes 2 (indicating it has been seen more than once), add it to the result list.
-
Return the Result:
- Return the list of repeated 10-letter sequences.
Complexity Analysis
-
Time Complexity: (O(n)) — You make one pass over the string using a sliding window, where (n) is the length of the string.
-
Space Complexity: (O(n)) — In the worst case, you might store almost every substring of length 10 in the dictionary.
Approach 2: Bit Manipulation (Rolling Hash)
Explanation
Since the DNA sequence contains only 4 characters, you can map each character to 2 bits (for example, A → 00, C → 01, G → 10, T → 11). This allows you to represent each 10-letter sequence as a 20-bit integer, which can significantly improve the efficiency of comparing and hashing these sequences.
Step-by-Step:
-
Mapping:
- Create a mapping from character to its bit representation.
-
Initial Hash Calculation:
- Calculate the integer value for the first 10-letter substring.
-
Rolling Hash:
- As you slide the window, update the hash value by:
- Removing the contribution of the leftmost character.
- Adding the contribution of the new rightmost character.
- Use bitwise operations to update the hash value efficiently.
- As you slide the window, update the hash value by:
-
Counting:
- Use a hash map or two sets to record sequences that have been seen and those that are repeated.
-
Return the Result:
- Convert the integer representations back to the substring if needed or track the original substring along with the hash.
Complexity Analysis
-
Time Complexity: (O(n)) — Each substring is processed in constant time using bitwise operations.
-
Space Complexity: (O(n)) — You need additional space for storing hash values or seen sequences.
Code Implementations
Python Code
Java Code
Common Mistakes and Edge Cases
-
Edge Case – String Length Less Than 10:
Always check if the string length is less than 10. In such cases, there can be no 10-letter-long substring. -
Overlapping Sequences:
Remember that the substrings may overlap. For example, in"AAAAAAAAAAAAA"
, the repeated sequence"AAAAAAAAAA"
overlaps and should still be counted. -
Incorrect Data Structures:
Using inefficient data structures may lead to timeouts or high memory usage, especially when processing long DNA strings.
Alternative Variations
-
Different Substring Lengths:
A variation of the problem could ask for repeated sequences of a different fixed length. -
Finding All Anagrams:
Instead of repeated sequences, you might be asked to find all starting indices of substrings that are anagrams of a given pattern. -
Generalized Pattern Matching:
Another variation might involve more complex pattern matching within the string rather than fixed-length substrings.
Related Problems
GET YOUR FREE
Coding Questions Catalog