I’ve been revisiting Algorithms on Strings, Trees and Sequences by Dan Gusfield and while reading about the bad character rule I thought I should drop a few of my notes and code here.

The source code can be found here https://github.com/dberg/strings.

We’ll start with a naive implementation of issubstring:

public static boolean issubstringv1(String pattern, String text) {
    int n = text.length() - pattern.length();
    for (int i = 0; i <= n; i++) {
        int j = i, k = 0;
        while (k < pattern.length() && text.charAt(j++) == pattern.charAt(k++));
        if (k == pattern.length()) return true;
    }
    return false;
}

In the naive implementation when a mismatch occurs the starting point of comparison in text (i) is increased by one while the comparison with the pattern string happens in the inner while loop.

The bad character rule

The bad character rules is an improvement over the naive implementation. In some situations it can reduce the number of character comparisons by shifting the pattern to the right by more than one character when a mismatch occurs.

The bad character rule requires the following changes to the naive implementation:

  1. pre-process the pattern string registering the last index occurrence of each character.

  2. compare characters right to left.

  3. when a mismatch occurs, if the character from text exists in the pre-processed pattern and its position in the pre-processed pattern is to the left of the current pattern position, the pattern can be shifted to the right so that the characters are aligned. Otherwise the pattern is shifted to the right by one position.

An example should make this easier to understand. Suppose we have the text AACCCBAAAAD and the pattern BAAAAD.

First, we pre-process the pattern generating the following mapping between characters and their highest index positions in the string:

B -> 0
A -> 4
D -> 5

Most textbooks recommend using an array for storing this mapping. This works fine if we’re dealing with a small alphabet such as ascii. Most likely though, we’re dealing with unicode and the memory usage becomes prohibitive. In our implementation we use a hash map.

After the mapping is built, we start the comparison at position 5. There we have a mismatch since text(5) is B and pattern(5) is D. We check the mapping for the character B and it points to the position 0 in the pattern. Then we shift the pattern to the right skipping the text prefix AACCC.

# beginning of first comparison iteration
01234567890
AACCCBAAAAD
BAAAAD

# beginning of second comparison iteration
01234567890
AACCCBAAAAD
     BAAAAD

Finally, the algorithm compares text at indexes from 10 to 5 against the pattern from indexes 5 to 0. In this case there’s a match and the algorithm returns true.

Here’s one possible implementation in java

public static boolean issubstringv2(String pattern, String text) {
    Map<Character, Integer> r = new HashMap<>();
    for (int i = 0; i < pattern.length(); i++)
        r.put(pattern.charAt(i), i);
    int n = text.length() - pattern.length();
    for (int i = 0; i <= n;) {
        int j = i + pattern.length() - 1;
        int k = pattern.length() - 1;
        while (k >= 0 && text.charAt(j--) == pattern.charAt(k--));
        if (k < 0) return true;
        char c = text.charAt(++j);
        if (r.containsKey(c)) {
            i += Math.max(k - r.get(c) + 1, 1);
        } else {
            i++;
        }
    }
    return false;
}

The extended bad character rule

The extended bad character rule deals with the case where on a mismatch the character from t is found to the right of the current index of the pattern string.

Suppose we have the pattern BABCCCAAB and the text CCCCCCBABCCAAB. The algorithm first compares the caracters at position 8 followed by position

  1. Both characters match the pattern.
01234567890123
CCCCCCBABCCAAB
BABCCCAAB

Next, at position 6 there’s a mismatch. If we follow the bad character rule we find that B is to the right of the current index of pattern. Of course, the algorithm wouldn’t shift the pattern to the left and instead it would move the pattern to the right by only one position.

The extended bad character rule works around this problem by keep tracking of each character in the pattern and all of their positions. In our implementation we use a linked list to store the mapping. Another alternative is to use binary search tree.

In this example, the pre-processing of the pattern yields the following map:

B -> (0, 2, 8)
A -> (1, 6, 7)
C -> (3, 4, 5)

Finally, when a mismatch occurs and the character in text is found in the mapping, we walk the linked list trying to find the first index that is to the left of the current pattern index. If one doesn’t exist we shift the pattern to the right by one position.

# beginning of first comparison iteration
01234567890123
CCCCCCBABCCAAB
BABCCCAAB

# beginning of second comparison iteration
01234567890123
CCCCCCBABCCAAB
    BABCCCAAB

Notice that because of the extended rule the algorithm shifted the pattern by 4 characters to the right while the standard bad character rule would only shift the pattern by one position.

Here’s the implementation in java:

public static boolean issubstringv3(String pattern, String text) {
    Map<Character, LinkedList<Integer>> r = new HashMap<>();
    for (int i = 0; i < pattern.length(); i++) {
        char c = pattern.charAt(i);
        if (!r.containsKey(c))
            r.put(c, new LinkedList<Integer>());
        LinkedList<Integer> l = r.get(c);
        l.addFirst(i);
    }
    int n = text.length() - pattern.length();
    for (int i = 0; i <= n;) {
        int j = i + pattern.length() - 1;
        int k = pattern.length() - 1;
        while (k >= 0 && text.charAt(j--) == pattern.charAt(k--));
        if (k < 0) return true;
        char c = text.charAt(++j);
        if (r.containsKey(c)) {
            k++;
            int step = -1;
            LinkedList<Integer> l = r.get(c);
            for (int p : l) {
                if (p < k) {
                    step = k - p;
                    break;
                }
            }
            i += Math.max(step, 1);
        } else {
            i++;
        }
    }
    return false;
}

A few notes about unicode

I dodged the bullet by returning not indexes but a boolean value. The type char in java is 16 bit which is not enough to store a code point. I’ll talk about this in a future post.