字符串匹配算法

Posted on 五月 3, 2017 本文总阅读量次

字符串基本操作

字符串时字符的有序集合。

举例：给定N长度的字符串S，所有字符向右移动K个位置，K<=N。

void shiftByK(char S[], char shiftedS[], int N, int K) {
    // Iterate through the length of given string
    for(int i=0; i<N; i++) {
        // Find the index for this current character in shiftedS[]
        int idx = (i+K) % N;
        // Copy that character at the found index idx
        shiftedS[idx] = S[i];
    }
    // Add a NULL character to mark the end of string
    shiftedS[N] = '\0';
}

字符串搜索

举例：从字符从T中找到字符P出现的次数。

暴力方法：判断Ti…Tk是等于P。时间复杂度 $ O(P*T) $

伪代码

for i = 0 to length(T)-length(P):
    Found = true
    for j = i to i + length(P) - 1:
        if P[j-i] not equal to T[j]:
            Found = False
    if Found = True 
        answer = answer + 1

KMP算法(Knuth Morris Pratt algori算法)
此算法通过运用对这个词在不匹配时本身就包含足够的信息来确定下一个匹配将在哪里开始的发现，从而避免重新检查先前匹配的字符。
时间复杂度$ O(P+T) $

举例：
>
S：abcxabcdabxabcdabcdabcy
P: abcdabcy

首先匹配到abc，x与d不匹配，这时不需要从头开始匹配，因为d之前的abc是唯一的，所以只写从x位置开始。
>
S：abcxabcdabxabcdabcdabcy
P: abcdabcy

因为x与a不匹配，继续后移。
>
S：abcxabcdabxabcdabcdabcy
P: abcdabcy

因为x与c不匹配，再看c之前的字符串abcdab，前缀ab和后缀ab相同，所以直接从ab之后的字符c开始匹配。
>
S：abcxabcdabxabcdabcdabcy
P: abcdabcy

不匹配，继续后移。
>
S：abcxabcdabxabcdabcdabcy
P: abcdabcy

d与y不匹配，再看y之前的字符从abcdabc，前缀abc与后缀abc相同，直接从abc之后的d开始匹配。
>
S：abcxabcdabxabcdabcdabcy
P: abcdabcy
完全匹配，找到了。

现在的问题是怎么实现高效的next方法

举例：字符从aabaabaaa，找next数组
>
P: aabaabaaa
i: 012345678
n:

首先j在对一个字符，i在第二个字符，第一个字符的next总是0

ji
P: aabaabaaa
i: 012345678
n: 0

j=a,i=a,相同,所有i的next是j的下标+1

ji
P: aabaabaaa
i: 012345678
n: 01

因为上一次是相同的，i，j同时向后移一位。j=a,i=b,不相同，j退到j之前字符的next值de位置，即0。

j i
P: aabaabaaa
i: 012345678
n: 01
因为j=a,i=b,不同，所以i向后移一位。

j i
P: aabaabaaa
i: 012345678
n: 010
j=a,i=a,相同，所有i的next是j的下标+1。重复以上的规则，最后得到：
j      i
P: aabaabaaa
i: 012345678
n: 010123452

利用next数组，我们可以知道当字符不匹配是，因为冲那个位置开始匹配。

举例：
>
S：abxabcabcaby
P: abcaby
n: 000120

当x与c不匹配是，我们察看c之前的字符的next是0,所有需要从第一个字符开始与x匹配。

实现：

/**
 * Date 09/22/2014
 * @author tusroy
 * 
 * Do pattern matching using KMP algorithm
 * 
 * Runtime complexity - O(m + n) where m is length of text and n is length of pattern
 * Space complexity - O(n)
 */
public class SubstringSearch {
    /**
     * Slow method of pattern matching
     */
    public boolean hasSubstring(char[] text, char[] pattern){
        int i=0;
        int j=0;
        int k = 0;
        while(i < text.length && j < pattern.length){
            if(text[i] == pattern[j]){
                i++;
                j++;
            }else{
                j=0;
                k++;
                i = k;
            }
        }
        if(j == pattern.length){
            return true;
        }
        return false;
    }
    
    /**
     * Compute temporary array to maintain size of suffix which is same as prefix
     * Time/space complexity is O(size of pattern)
     */
    private int[] computeTemporaryArray(char pattern[]){
        int [] lps = new int[pattern.length];
        int index =0;
        for(int i=1; i < pattern.length;){
            if(pattern[i] == pattern[index]){
                lps[i] = index + 1;
                index++;
                i++;
            }else{
                if(index != 0){
                    index = lps[index-1];
                }else{
                    lps[i] =0;
                    i++;
                }
            }
        }
        return lps;
    }
    
    /**
     * KMP algorithm of pattern matching.
     */
    public boolean KMP(char []text, char []pattern){
        
        int lps[] = computeTemporaryArray(pattern);
        int i=0;
        int j=0;
        while(i < text.length && j < pattern.length){
            if(text[i] == pattern[j]){
                i++;
                j++;
            }else{
                if(j!=0){
                    j = lps[j-1];
                }else{
                    i++;
                }
            }
        }
        if(j == pattern.length){
            return true;
        }
        return false;
    }
        
    public static void main(String args[]){
        
        String str = "abcxabcdabcdabcy";
        String subString = "abcdabcy";
        SubstringSearch ss = new SubstringSearch();
        boolean result = ss.KMP(str.toCharArray(), subString.toCharArray());
        System.out.print(result);
        
    }
}