-
Notifications
You must be signed in to change notification settings - Fork 1
KMP
Pattern matching algorithm - find a needle(pattern) in haystack(text). Runs in O(n) avg/worst case time complexity.
- Reduce the number of comparisons made when checking for a match for the pattern in the text.
- Implemented by preprocessing the pattern and constructing a longest prefix which is also a suffix (LPS).
Example s: "acccbaaacccbaac" lps for s: [0,0,0,0,0,1,1,1,2,3,4,5,6,7,2]
lps[i] returns the longest prefix which is also the suffix for s(0..i)
-
Starting from the 1st index(end), we slide and check if there is a match from the 0th index(begin) of s.
-
In case of match, we increment both the indices, and make lps for the endIndex as the beginIndex ==> at this position, we have found a beginIndex length where the s(0..beginIndex) is the same as s(endIndex-beginIndex..endIndex)
In the above example, consider endIndex = 9, beginIndex= 2, lps[9] = 3. s(0..2) = acc is the same as s(7..9) = acc
- If it's not a match, if beginIndex > 0, we move the beginIndex to the lps[beginIndex-1] ==> find the largest prefix which is also a prefix at position beginIndex-1. This way we don't have to compare the lps[beginIndex-1] chars as they have already been compared before.
Again in the above example, consider endIndex = 15, beginIndex = 7. s[beginIndex] = a, s[endIndex] = c. Since they don't match, you step back beginIndex = lps[beginIndex-1] = lps[6] = 1. This is because all the chars before beginIndex which is s[0..beginIndex-1] = a, already match in s[endIndex-beginIndex..endIndex-1] = a. Hence we only compare from the beginIndex which is s[1] = c with endIndex which is s[15] = c and since they match we have lps[15] = newBeginIndex (which is beginIndex+1) = 2
- And finally, if its not a match and if beginIndex is 0, we cannot go back any more and set lps for that endIndex as 0 and move forward on the endIndex.
private void PreprocessLPS(string s, int[] lps)
{
int i=0,j=1;
while (j < s.Length)
{
if (s[i] == s[j])
{
lps[j] = ++i;
j++;
}
else if (lps[i] == 0) j++;
else i = lps[i-1];
}
}
Example text: "mississippi", pattern: "issip", lps on pattern [0,0,0,1,0]
-
On similar lines to LPS, you have a textIndex and patternIndex, and you perform comparisons in the sliding window.
-
In case of a match, you increment both the indices.
-
If it's not a match, and if the patternIndex > 0, you move the patternIndex to lps[patternIndex-1].
Ex: when textIndex = 5, patternIndex = 4, the chars don't match and you move the patternIndex to lps[4-1] = 1. This is because the substring pattern(0..patternIndex-1) which is "i" is already checked for and is present at text(textIndex-patternIndex..textIndex) = "i". So, you check from patternIndex = 1, and textIndex = 5
- And finally, if it's not a match and if the patternIndex is 0, we cannot go back any more in the pattern, hence increment the textIndex and start searching afresh
// returns the first index of the needle position in the haystack
public int StrStr(string haystack, string needle)
{
if (needle.Length == 0) return 0;
int[] lps = new int[needle.Length];
PreprocessLPS(needle, lps);
int i=0,j=0;
while (i < haystack.Length)
{
if (haystack[i] == needle[j])
{
i++; j++;
if (j == needle.Length) return i-needle.Length;
}
else if (j == 0) i++;
else j = lps[j-1];
}
return -1;
}
https://leetcode.com/problems/implement-strstr/
https://leetcode.com/problems/longest-happy-prefix/
https://leetcode.com/problems/shortest-palindrome/