187. Repeated DNA Sequences - Detailed Explanation

Free Coding Questions Catalog
Boost your coding skills with our essential coding questions catalog. Take a step towards a better tech career now!

Problem Statement

You are given a string s that represents a DNA sequence, where each character is one of 'A', 'C', 'G', or 'T'. Your task is to find all the 10-letter-long sequences (substrings) that occur more than once in this DNA molecule. You can return the answer in any order.

Example 1
Input: "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC", "CCCCCAAAAA"]
Explanation:
The substring "AAAAACCCCC" appears twice, and the substring "CCCCCAAAAA" appears twice. These are the 10-letter sequences that repeat.

Example 2
Input: "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]
Explanation:
The 10-letter substring "AAAAAAAAAA" appears multiple times in overlapping fashion.

Constraints:

  • (1 \leq s.length \leq 10^5)
  • s[i] is one of 'A', 'C', 'G', or 'T'.

Hints

  1. Sliding Window: Think about how you can use a sliding window of length 10 to iterate over every 10-letter substring of the input string.

  2. Hashing/Counting: Consider using a hash set or a hash map (dictionary) to count how many times each 10-letter substring appears as you slide over the string.

Approach 1: Hash Map Counting

Explanation

In this straightforward approach, you slide a window of size 10 over the string and extract every substring of length 10. For each substring, you use a hash map (or dictionary) to count its occurrences.

Step-by-Step:

  1. Initialize Data Structures:

    • A dictionary (or hash map) to store each 10-letter substring and its count.
    • A list to collect the substrings that have been seen more than once.
  2. Sliding Window:

    • Iterate from the start of the string until ( \text{len}(s) - 10 + 1 ).
    • For each position, extract the substring of length 10.
  3. Counting:

    • If the substring is not in the dictionary, add it with a count of 1.
    • If it is already present and its count becomes 2 (indicating it has been seen more than once), add it to the result list.
  4. Return the Result:

    • Return the list of repeated 10-letter sequences.

Complexity Analysis

  • Time Complexity: (O(n)) — You make one pass over the string using a sliding window, where (n) is the length of the string.

  • Space Complexity: (O(n)) — In the worst case, you might store almost every substring of length 10 in the dictionary.

Approach 2: Bit Manipulation (Rolling Hash)

Explanation

Since the DNA sequence contains only 4 characters, you can map each character to 2 bits (for example, A → 00, C → 01, G → 10, T → 11). This allows you to represent each 10-letter sequence as a 20-bit integer, which can significantly improve the efficiency of comparing and hashing these sequences.

Step-by-Step:

  1. Mapping:

    • Create a mapping from character to its bit representation.
  2. Initial Hash Calculation:

    • Calculate the integer value for the first 10-letter substring.
  3. Rolling Hash:

    • As you slide the window, update the hash value by:
      • Removing the contribution of the leftmost character.
      • Adding the contribution of the new rightmost character.
    • Use bitwise operations to update the hash value efficiently.
  4. Counting:

    • Use a hash map or two sets to record sequences that have been seen and those that are repeated.
  5. Return the Result:

    • Convert the integer representations back to the substring if needed or track the original substring along with the hash.

Complexity Analysis

  • Time Complexity: (O(n)) — Each substring is processed in constant time using bitwise operations.

  • Space Complexity: (O(n)) — You need additional space for storing hash values or seen sequences.

Code Implementations

Python Code

Python3
Python3

. . . .

Java Code

Java
Java

. . . .

Common Mistakes and Edge Cases

  • Edge Case – String Length Less Than 10:
    Always check if the string length is less than 10. In such cases, there can be no 10-letter-long substring.

  • Overlapping Sequences:
    Remember that the substrings may overlap. For example, in "AAAAAAAAAAAAA", the repeated sequence "AAAAAAAAAA" overlaps and should still be counted.

  • Incorrect Data Structures:
    Using inefficient data structures may lead to timeouts or high memory usage, especially when processing long DNA strings.

Alternative Variations

  • Different Substring Lengths:
    A variation of the problem could ask for repeated sequences of a different fixed length.

  • Finding All Anagrams:
    Instead of repeated sequences, you might be asked to find all starting indices of substrings that are anagrams of a given pattern.

  • Generalized Pattern Matching:
    Another variation might involve more complex pattern matching within the string rather than fixed-length substrings.

TAGS
leetcode
CONTRIBUTOR
Design Gurus Team
-

GET YOUR FREE

Coding Questions Catalog

Design Gurus Newsletter - Latest from our Blog
Boost your coding skills with our essential coding questions catalog.
Take a step towards a better tech career now!
Explore Answers
Related Courses
Grokking the Coding Interview: Patterns for Coding Questions
Grokking the Coding Interview Patterns in Java, Python, JS, C++, C#, and Go. The most comprehensive course with 476 Lessons.
Grokking Modern AI Fundamentals
Master the fundamentals of AI today to lead the tech revolution of tomorrow.
Grokking Data Structures & Algorithms for Coding Interviews
Unlock Coding Interview Success: Dive Deep into Data Structures and Algorithms.
Image
One-Stop Portal For Tech Interviews.
Copyright © 2025 Design Gurus, LLC. All rights reserved.
;