287. Find the Duplicate Number - Detailed Explanation

Problem Statement

Given an array of integers nums containing n + 1 elements where each element is an integer in the range [1, n] inclusive, find the single integer that is repeated in the array. There is exactly one distinct number that appears more than once (it could appear twice or multiple times), and we need to return that duplicate number.

Goal: Identify the duplicate number in the array.
Important: Solve it without modifying the input array and using only constant extra space (O(1) space).

Example 1:

Input: nums = [1, 3, 4, 2, 2]
Output: 2
Explanation: The array has numbers from 1 to 4. The number 2 appears twice, so 2 is the duplicate.

Example 2:

Input: nums = [3, 1, 3, 4, 2]
Output: 3
Explanation: The numbers are in the range 1 to 4. The number 3 appears twice (positions 0 and 2). Hence, 3 is the duplicate.

Example 3:

Input: nums = [1, 1]
Output: 1
Explanation: With only two elements (n=1, array length 2), the number 1 appears twice, making it the duplicate.

Constraints:

1 <= n <= 100,000 (The array contains n+1 elements, so length is at most 100,001).
nums.length == n + 1 (array length is one more than the maximum number value).
1 <= nums[i] <= n for all valid indices i (every value is between 1 and n).
All integers in nums appear only once except for one integer which appears two or more times.
You must find a solution that does not modify the array and uses O(1) extra space (only a constant amount of additional memory).

These constraints imply that by the Pigeonhole Principle, at least one number is repeated. We have to find that number efficiently.

Hints

Hint 1: Brute force or naive methods (like checking every pair of elements) will work for small arrays but are too slow for large n. Think about what the constraints (up to 100k elements) imply for the time complexity of your solution.
Hint 2: If modifying the array or using extra memory were allowed, the problem becomes easier (sorting or using a hash set would quickly find the duplicate). However, since we cannot modify the array and want to use constant extra space, consider alternative strategies.
Hint 3: The range of values is 1 to n for an array of length n+1. This relationship is key. Can you leverage this to find the duplicate? For example, is there a way to count or compare the distribution of numbers to narrow down the search?
Hint 4: One efficient approach involves using two pointers moving through the array at different speeds (this is known as Floyd’s Tortoise and Hare algorithm). This technique is usually used to detect cycles in linked lists, but it can be applied here by mapping the array’s indices and values to a linked structure.
Hint 5: Another clever approach is to use binary search on the range of values rather than the indices. Think about how you could check a guess for the duplicate by counting elements and adjust the guess based on the count.

Use these hints to brainstorm possible solutions before reading the detailed approaches below.

Approaches

We will discuss multiple approaches to solve the problem, starting from the simplest (but not always feasible) methods and moving to more optimal ones. For each approach, we'll outline the logic, provide a step-by-step example, show Python and Java implementations, and analyze the time and space complexity.

1. Brute Force (Naive Approach)

Logic:
The most straightforward way to find a duplicate is to compare every element with every other element and check for equality. If any pair of elements are equal, we've found a duplicate. Since the problem guarantees there is exactly one repeated number, we can stop when we find it.

Step-by-Step Explanation:
Imagine checking each number against all others:

For each index i from 0 to n, loop through indices j from i+1 to n and compare nums[i] with nums[j].
If nums[i] == nums[j] at any point, then nums[i] (or nums[j]) is the duplicate.

Visual Example (Brute Force on [1, 3, 4, 2, 2]):

Compare index 0 (value 1) with 1(3), 2(4), 3(2), 4(2): no match until index 3 where 1 vs 2 (no), index 4 where 1 vs 2 (no). No duplicate found with element 1.
Compare index 1 (value 3) with 2(4), 3(2), 4(2): no match until end (3 vs 4, 3 vs 2, 3 vs 2 all different).
Compare index 2 (value 4) with 3(2), 4(2): no match (4 vs 2, 4 vs 2).
Compare index 3 (value 2) with 4 (value 2): this time we find a match (2 == 2). Found the duplicate: 2.

This method will eventually find the duplicate (in this case, when comparing the two 2's). However, it is very slow for large arrays because it potentially checks every possible pair.

Complexity: This approach runs in O(n²) time in the worst case (double loop over the array), which is not feasible when n is large (e.g., 100k would mean ~10^10 comparisons). The space complexity is O(1) extra space (just a couple of loop counters).

Python Implementation (Brute Force):

Python3

. . . .

Java Implementation (Brute Force):

Java

. . . .

Analysis: Brute force is easy to understand but impractical for large inputs. It does not meet the efficiency requirements for big n. We include it here as a starting point and for completeness, but we need better solutions for real use.

2. Sorting the Array

Logic:
If we sort the array, any duplicates will become adjacent to each other. Then, we can simply scan through the sorted array and look for consecutive elements that are the same.

Steps:

Create a sorted copy of the array (to avoid modifying the original array, since that's not allowed by the problem). Sorting takes O(n log n) time.
Traverse the sorted array and check for any consecutive equal elements.
The first time you find sorted[i] == sorted[i+1], that's the duplicate number.

Visual Example (Sorting [3, 1, 3, 4, 2]):

Original array: [3, 1, 3, 4, 2]
Sorted array: [1, 2, 3, 3, 4]
Now scan sorted array:

1 vs 2 (different)
2 vs 3 (different)
3 vs 3 (same!) → Found duplicate 3.
(No need to continue after finding the duplicate, but if we did, 3 vs 4 would be different)

Complexity: Sorting takes O(n log n) time. Scanning for duplicates after sorting is O(n).

Overall time complexity: O(n log n). Space complexity depends on how sorting is implemented:

If we sort in place (allowed by some languages), we use O(1) extra space (but we are modifying the array, which the problem disallows).
If we create a copy to sort, it uses O(n) extra space for the copy.

In an interview, you'd mention that sorting breaks the "no modifying array" rule (if done in place) and uses more space if not in place. But it's a valid approach to understand.

Python Implementation (Sorting):

Python3

. . . .

Java Implementation (Sorting):

Java

. . . .

Explanation: If you input the array 3 1 3 4 2 (n=5 elements), after sorting it becomes 1 2 3 3 4, and the program outputs 3 as the duplicate.

Analysis: Sorting is more efficient than brute force (O(n log n) vs O(n²)) and is straightforward. However, it may violate problem constraints (by modifying the array or using extra space) and is not the optimal solution. We can do better in terms of time (there is an O(n) solution) and space.

3. Using a HashSet (or extra memory)

Logic:
We can leverage a data structure to track seen numbers as we iterate. A HashSet (or any set) is ideal because checking for membership and adding elements are on average O(1) operations. As we go through the array:

For each number, check if it's already in the set.
If it is, we've found the duplicate.
If not, add it to the set and continue.

This approach does not modify the original array. The trade-off is it uses O(n) extra space for the set, which violates the strict constant space requirement but is acceptable if memory usage is not a big concern.

Step-by-Step Example (HashSet on [1, 3, 4, 2, 2]):

Initialize an empty set seen = {}.
Traverse the array:

Read 1: not in seen, add it. (seen becomes {1})
Read 3: not in seen, add it. (seen becomes {1, 3})
Read 4: not in seen, add it. (seen becomes {1, 3, 4})
Read 2: not in seen, add it. (seen becomes {1, 2, 3, 4})
Read the next 2: this time, 2 is already in seen! This means we've seen 2 before, so 2 is the duplicate. Return 2.

We found the duplicate when we tried to add a number that's already in our set.

Complexity: Checking and adding in a hash set is O(1) on average, so going through n elements is O(n) time.

Space complexity is O(n) for the hash set in the worst case (if the duplicate is at the very end, the set grows to size ~n).

Python Implementation (HashSet):

Python3

. . . .

Java Implementation (HashSet):

Java

. . . .

Example: For input 1 3 4 2 2, the program will output 2. As it reads the list, it finds that when it reaches the second 2, it was seen before, indicating 2 is the duplicate.

Analysis: The hash set approach is very straightforward and efficient in time (O(n)). The downside is using extra memory proportional to n, which might be an issue for extremely large arrays or if the problem explicitly forbids using O(n) space. Nonetheless, this method is often acceptable in practice when memory isn’t tight, and it’s a great initial solution to implement in an interview before discussing more optimal solutions.

4. Binary Search on Value Range (Pigeonhole Principle)

Logic:
This approach uses a clever application of binary search on the range of possible values (1 to n), rather than on indices or sorting the array. The idea is based on the Pigeonhole Principle:

If there are more than mid numbers in the array that are <= mid, then the duplicate must be in the range 1...mid. Otherwise, the duplicate is in the range mid+1...n.

Here's how it works:

Set low = 1 and high = n (the range of values).
While low < high:
- Compute mid = (low + high) // 2 (integer division).
- Count how many numbers in nums are <= mid.
- If the count is greater than mid (meaning there are too many numbers in the lower half range), then the duplicate is in [low, mid]. So, set high = mid.
- Otherwise, the duplicate lies in [mid+1, high]. So, set low = mid + 1.
When low == high, it will be at the start of the duplicate range, which is the duplicate number.

Why does this work?
In a perfect scenario with no duplicates, the count of numbers <= X should be at most X. For example, if n=8 and no duplicates, the count of numbers <=4 would be 4 (at most). If there's a duplicate, this count skews higher for the half containing the duplicate because that range ends up with an extra occurrence. Essentially, we are narrowing down the possible value of the duplicate by seeing which half of the range has an "excess" count.

Step-by-Step Example (Binary Search on [1, 3, 4, 2, 2]):

Initial range of values = [1, 4].
- low = 1, high = 4.
First iteration:
- mid = (1+4)//2 = 2.
- Count numbers <= 2 in the array: The array is [1,3,4,2,2]. Numbers <= 2 are [1,2,2] (there are 3 of them).
- Compare count (3) with mid (2): 3 > 2, which means there are more numbers in 1..2 than there should be if all were unique. This indicates the duplicate is in the range [1..2].
- So, set high = 2.
Second iteration:
- Now low = 1, high = 2.
- mid = (1+2)//2 = 1.
- Count numbers <= 1 in the array: [1,3,4,2,2] → only [1] (count is 1).
- Compare count (1) with mid (1): 1 == 1, no excess in [1..1]. So the duplicate must be in the other half.
- Set low = mid + 1 = 2.
Now low = 2, high = 2, so the loop ends. low (or high) is 2, which is our duplicate.

We found the duplicate value 2 without ever explicitly comparing pairwise or modifying the array – just by using counts. Notice that we never needed to know which index the duplicate is at, only that it exists in the value range.

Complexity: Each iteration of the binary search loop counts the elements <= mid which takes O(n) time. How many iterations at most? The value range is from 1 to n, so binary search will take O(log n) iterations. Hence, the overall time complexity is O(n log n). The space complexity is O(1) additional, since we only use a few counters and variables.

Even though O(n log n) is slower than O(n), it can handle n up to 100k easily (log2(100k) ≈ 17, so roughly 17 * 100k operations, which is fine). This method meets the requirements of not modifying the array and using constant extra space.

Python Implementation (Binary Search on range):

Python3

. . . .

Java Implementation (Binary Search on range):

Java

. . . .

Explanation: For input array 1 3 4 2 2, the function will count numbers <= mid for mid=2 and mid=1 as shown in the example, narrowing down to the duplicate 2. The output printed is 2.

Analysis: The binary search approach is efficient and meets the problem's constraints (no modification, O(1) extra space). It's a bit less intuitive than using a set or sorting, but it's a great technique to know. It leverages the properties of a sorted range of values without actually sorting the array itself. This approach is also known as using a pigeonhole count or binary search on the value domain.

5. Floyd’s Tortoise and Hare (Cycle Detection)

Logic:
This is the most elegant and optimal solution, which uses the concept of cycle detection in linked lists. It might not be immediately obvious how a cycle is involved here, but let's break it down:

Because the array has n+1 elements and each element is in the range 1..n, if you interpret the array as a directed graph (or linked structure) where each index points to the index given by the value at that position, you must have a cycle.
Think of it this way: Take f(i) = nums[i] as a "next" pointer. Start at index 0 and keep following the pointers: 0 -> nums[0] -> nums[nums[0]] -> .... Since there are finite indices and we have an extra element, eventually two indices must point to the same index (pigeonhole principle again in graph form), causing a loop.
The "cycle" in this graph is caused by the duplicate number. Specifically, the duplicate number is the entry point of the cycle because it’s the value that is referenced by more than one index.

Floyd’s Tortoise and Hare algorithm finds a cycle's entry point in a linked list using two pointers:

Phase 1 (Finding intersection): Use two pointers (tortoise and hare). Initialize both to the start (index 0). Move tortoise one step at a time, hare two steps at a time, until they meet inside the cycle.
Phase 2 (Finding entrance to cycle): Once an intersection is found, reset one pointer to start (index 0). Then move both pointers one step at a time. They will meet at the entrance of the cycle, which corresponds to the duplicate number.

Why it works: In the context of a linked list, Floyd’s algorithm finds the start of a cycle (loop) without extra space. Here, treating nums as a function from index to next index, a cycle exists and the duplicate value is the one that creates the cycle. The math behind Floyd's algorithm guarantees that moving one pointer from start and one from the intersection point eventually converges at the cycle start.

Python Implementation (Floyd’s Tortoise and Hare):

Python3

. . . .

Java Implementation (Floyd’s Tortoise and Hare):

Java

. . . .

Example: For the input array 3 1 3 4 2, the two pointers will eventually meet at the value 3, which is the duplicate. The program will output 3. Similarly, for input 1 3 4 2 2, they would meet at 2.

Analysis: This approach runs in O(n) time and uses only O(1) extra space, fulfilling the tough constraints. Floyd’s algorithm is initially non-intuitive for array problems, but it’s a brilliant trick for situations where a "cycle" is formed due to a pigeonhole condition (like this problem). Once recognized, it provides a clean and efficient solution.

Common Mistakes

Beginners often run into a few pitfalls with this problem:

Modifying the array when not allowed: For example, sorting the array in place or using methods that alter the array (like swapping elements) may violate the problem conditions if the interviewer expects you not to change the input. Always clarify whether modifying input is acceptable. In this LeetCode problem, it specifically is not.
Using too much extra memory: A solution that uses a data structure like a list or hash map to store frequencies will find the duplicate, but it uses O(n) extra space. If the problem expects O(1) space, that solution, while correct, is not optimal. It’s a common initial solution (and fine for a first attempt), but you should then discuss how to reduce space.
Brute force on large inputs: Writing a double loop to compare every pair will work logically, but it will time out or be extremely slow for large arrays (like n = 100k). This is a mistake in terms of efficiency.
Off-by-one errors in binary search: When implementing the binary search on the value range, it's easy to make mistakes with the low and high boundaries or the mid calculation. For instance, not updating low or high correctly can lead to an infinite loop. Always test the binary search logic on simple cases (like a small array) to ensure it narrows down properly.
Misunderstanding Floyd’s algorithm: A common mistake is to think that the meeting point of the tortoise and hare in phase 1 is the duplicate. It’s not — the meeting point is just an intersection inside the cycle. You must do the second phase to find the actual duplicate. Also, forgetting to reset one pointer to the start for phase 2, or starting them incorrectly, will lead to wrong results or infinite loops. Another subtle bug can be not initializing tortoise and hare to the same starting position. In our solution, we started both at nums[0]. If you start one at index 0 and the other at nums[0] or something different, it won't work correctly.
Using mathematical formulas incorrectly: One might think of using the formula for sum of 1..n or XOR of 1..n to find the duplicate. For example, computing sum(nums) - sum(1..n) would equal the duplicate if the duplicate appears exactly twice. However, if the duplicate appears more than twice, this difference would be a multiple of the duplicate, not equal to it. Relying on such a trick can be risky and fails for cases where the duplicate count > 2. Similarly, XOR-based methods (XOR of all numbers vs XOR of 1..n) fail when the duplicate appears more than twice (and XOR doesn't directly give counts).
Not considering all constraints: Not realizing that the numbers range from 1 to n (and not 0 to n or any other range) might lead to confusion. That range detail is crucial for the cycle detection method. Also, not noticing that there is exactly one duplicate (and not multiple distinct duplicates) simplifies the problem — some might try to handle finding multiple duplicates unnecessarily.

Always double-check the problem constraints and ensure your solution logic aligns with them.

Edge Cases

Consider these edge and special cases to ensure your solution handles them:

Smallest array (n=1): The array of length 2 (since n+1) would be [1, 1]. The duplicate is obviously 1. Make sure the code handles this trivial case (our approaches do handle this: Floyd’s algorithm, for example, would immediately find the cycle of 1 -> 1).
Duplicate is the smallest number: e.g. nums = [1, 2, 3, 1]. Here n=3, array length 4. The duplicate (1) is at the start and end. Any approach should find 1. This tests if the algorithm correctly identifies a duplicate that occurs at the beginning of the array.
Duplicate is the largest number: e.g. nums = [2, 4, 1, 3, 4] (n=4, length 5). The number 4 (the max value) is duplicated. Ensure the binary search approach correctly handles the upper half in such a case, and that others also catch it.
Duplicate in the middle: e.g. nums = [1, 4, 2, 4, 3]. The duplicate is 4 in the middle of the array. This is a normal case but good to test different positions.
All elements same (except one): This scenario can't fully happen under the given constraints because if one number repeats many times, some other would be missing (violating "all others appear once"). However, a near scenario: if n is large and the duplicate number appears more than twice, e.g. nums = [5, 1, 2, 3, 4, 5, 5] for n=5 (here 5 appears three times and some number appears 0 times, which technically breaks the rule of all others appearing once). The problem constraints avoid this exact case, but our algorithms (especially Floyd's and binary search) would still handle it because the logic only needs one duplicate to exist. It’s worth knowing that Floyd’s cycle detection will still find the duplicate entrance even if the duplicate count is more than 2.
Nearly sorted array vs jumbled array: Some solutions might fail if they assume something about input order. Our described approaches do not rely on any ordering. For instance, nums = [1, 2, 3, 4, 2] and nums = [2, 1, 4, 3, 2] should both yield 2.
Performance on large input: If n is at the upper limit (100,000), the solution needs to be efficient. Brute force would never finish in reasonable time. Floyd’s algorithm or the binary search method will handle it fine. It’s good to mentally consider that scale to be confident in your complexity analysis.

Always test edge cases in your mind or with actual code to be sure your solution doesn’t break on the extremes.

Alternative Variations

This problem has a specific scenario (one duplicate in an array of n+1 with range 1..n), but there are several related variations and extensions. Let's discuss a few and how you might approach them:

Allowing modifications to the array: If the problem did not forbid modifying the array, you could use some neat tricks:
- In-place swapping (Cyclic sort approach): Since we know numbers are 1..n, one approach is to try to "place" each number at its correct index (i.e., number x should ideally be at index x). You can iterate through the array and swap elements to their correct position (for example, if nums[i] != i and nums[i] != nums[ nums[i] ], swap them into place). During this process, if you ever find that the target position already has the same number, you've found a duplicate. This is O(n) time and O(1) space, but it actively rearranges the array.
- Marking visited indices: Another trick is to treat the array indices as a hash by marking visited spots. For example, for each number you encounter, go to the index corresponding to that number and negate the value there to mark it visited. If you come across a number whose corresponding index is already negative, that number is a duplicate. This also modifies the array (and you have to be careful to restore it if needed). These methods are efficient but mutate the input array, which was not allowed in the original problem. However, they're great if modifications are permitted or if it’s a different problem variant (like many "find duplicates" problems allow this).
Multiple duplicates (finding all duplicates): Suppose the array could have more than one distinct duplicate (e.g., two different numbers each appearing twice). An example: [1, 2, 2, 3, 4, 4, 5] has duplicates 2 and 4. The approach changes because Floyd’s algorithm only finds one cycle entrance and the problem guarantee of one duplicate is gone. For finding all duplicates:
- You could use a hash set to collect all numbers that appear more than once (just check as you go, similar to our HashSet approach but don't break on first found).
- If modifications are allowed, the negative marking technique (or cyclic sort) can be extended to find all duplicates by marking and checking. LeetCode has a problem "Find All Duplicates in an Array" which specifically asks for all duplicates; the typical solution is to use the index marking trick to get O(n) time and O(1) space.
Duplicate and missing number together: Another variation is when you have an array of length n (not n+1) containing numbers 1..n, where one number is missing and one is duplicated (often called the "Set Mismatch" problem). For example: [1,2,2,4] (here 3 is missing and 2 is duplicate). This requires finding two numbers: the duplicate and the missing. Floyd’s algorithm doesn’t directly apply because the length isn't n+1. Instead, you can use math: the sum or XOR method can be applied here because the presence of a missing number balances the equations. Or you can modify the array (using index marking or swapping) to detect both anomalies. This is a different problem but conceptually related.
Different ranges or constraints: If the array values weren’t guaranteed to be 1..n, Floyd’s cycle detection wouldn’t directly work. You might then rely on hashing or sorting. Or if the array is extremely large but numbers are in a smaller range, counting sort or frequency arrays could be considered. Always adapt to the given constraints.
Continuous input stream or read-only array: If you cannot even read all numbers at once (say it's a stream) or the array is read-only, you might adapt by using the binary search counting method because it only needs counts (which you can accumulate on the fly or via multiple passes). Floyd’s requires random access but not modification, so it still works on read-only arrays.

In summary, the approach can change based on what you're allowed to do:

If you can sort or modify, simpler solutions become viable.
If you expect multiple duplicates, you have to collect more than one result.
If a missing element is involved, combine strategies (like sum/XOR for one missing & one duplicate).

Understanding the core techniques (sorting, hashing, binary search counting, cycle detection, index marking) gives you a toolkit to handle any of these variations.

Learning to solve this problem opens the door to several other algorithmically similar problems. Here are some related LeetCode problems and concepts:

LeetCode 268. Missing Number: You are given an array of size n containing numbers from 0 to n (one number missing). This is a complement to the duplicate problem. Techniques like using sum or XOR to find the missing number are common. (In a sense, our problem is a "duplicate number" instead of a missing number, but both use the idea of an extra or missing count.)
LeetCode 41. First Missing Positive: While not about duplicates, it's another problem that uses index marking (cyclic sort idea) on an array with a constrained range (1..n) to find the smallest missing number. It’s good practice for using array index manipulation tricks, similar in spirit to some duplicate finding techniques.
LeetCode 219. Contains Duplicate II

287. Find the Duplicate Number - Detailed Explanation

Problem Statement

Hints

Approaches

1. Brute Force (Naive Approach)

2. Sorting the Array

3. Using a HashSet (or extra memory)

4. Binary Search on Value Range (Pigeonhole Principle)

5. Floyd’s Tortoise and Hare (Cycle Detection)

Common Mistakes

Edge Cases

Alternative Variations

Related Problems