daniberg.com

Bad Character Rule

This post contains my notes and implementation of the bad character rule algorithm. It's inspired by Algorithms on Strings, Trees and Sequences by Dan Gusfield.

The source code can be found on github.

Let's start with a naive O(nm) implementation of substring

public static List substringV1(String pattern, String text) {
    List indices = new ArrayList<>();
    int n = text.length() - pattern.length();
    for (int i = 0; i <= n; i++) {
        int j = i, k = 0;
        while (k < pattern.length() && text.charAt(j++) == pattern.charAt(k++));
        if (k == pattern.length()) indices.add(i);
    }
    return indices;
}

In the naive implementation when a mismatch occurs the starting point of comparison in text (i) is increased by one while the comparison with the pattern string happens in the inner while loop.

The bad character rule

The bad character rules is an improvement over the naive implementation. In some situations it can reduce the number of character comparisons by shifting the pattern to the right by more than one character when a mismatch occurs.

The bad character rule requires the following changes to the naive implementation

1. pre-process the pattern string registering the last index occurrence of each
   character.

1. compare characters right to left.

3. when a mismatch occurs, if the character from text exists in the
   pre-processed pattern and its position in the pre-processed pattern is to the
   left of the current pattern position, the pattern can be shifted to the right
   so that the characters are aligned. Otherwise the pattern is shifted to the
   right by one position.

An example should make this easier to understand. Suppose we have the text AACCCBAAAAD and the pattern BAAAAD.

First, we pre-process the pattern generating the following mapping between characters and their highest index positions in the string:

B -> 0
A -> 4
D -> 5

Most textbooks recommend using an array for storing this mapping. This works fine if we're dealing with a small alphabet such as ascii. Most likely though, we're dealing with unicode and the memory usage becomes prohibitive. In our implementation we use a hash map.

After the mapping is built, we start the comparison at position 5. There we have a mismatch since text(5) is B and pattern(5) is D. We check the mapping for the character B and it points to the position 0 in the pattern. Then we shift the pattern to the right skipping the text prefix AACCC.

# beginning of first comparison iteration
01234567890
AACCCBAAAAD
BAAAAD

# beginning of second comparison iteration
01234567890
AACCCBAAAAD
     BAAAAD

Finally, the algorithm compares text at indexes from 10 to 5 against the pattern from indexes 5 to 0. In this case there's a match and the algorithm returns true.

Here's one possible implementation in java

public static List substringV2(String pattern, String text) {
    List indices = new ArrayList<>();
    if (pattern.length() > text.length()) return indices;

    Map r = new HashMap<>();
    for (int i = 0; i < pattern.length(); i++)
        r.put(pattern.charAt(i), i);

    for (int n = pattern.length() - 1; n < text.length();) {
        int k = n; // index on text
        int i = pattern.length() - 1; // index on pattern
        while (i >= 0 && text.charAt(k) == pattern.charAt(i)) { i--; k--; }
        if (i < 0) {
            indices.add(k + 1);
            n = n + 1;
        } else {
            n = n + Math.max(1, i - r.getOrDefault(text.charAt(k), 0));
        }
    }

    return indices;
}

The extended bad character rule

The extended bad character rule deals with the case where on a mismatch the character from t is found to the right of the current index of the pattern string.

Suppose we have the pattern BABCCCAAB and the text CCCCCCBABCCAAB. The algorithm first compares the caracters at position 8 followed by position 7. Both characters match the pattern.

01234567890123
CCCCCCBABCCAAB
BABCCCAAB

Next, at position 6 there's a mismatch. If we follow the bad character rule we find that B is to the right of the current index of pattern. Of course, the algorithm wouldn't shift the pattern to the left and instead it would move the pattern to the right by only one position.

The extended bad character rule works around this problem by keep tracking of each character in the pattern and all of their positions. In our implementation we use a linked list to store the mapping. Another alternative is to use binary search tree.

In this example, the pre-processing of the pattern yields the following map

B -> (0, 2, 8)
A -> (1, 6, 7)
C -> (3, 4, 5)

Finally, when a mismatch occurs and the character in text is found in the mapping, we walk the linked list trying to find the first index that is to the left of the current pattern index. If one doesn't exist we shift the pattern to the right by one position.

# beginning of first comparison iteration
01234567890123
CCCCCCBABCCAAB
BABCCCAAB

# beginning of second comparison iteration
01234567890123
CCCCCCBABCCAAB
    BABCCCAAB

Notice that because of the extended rule the algorithm shifted the pattern by 4 characters to the right while the standard bad character rule would only shift the pattern by one position.

Here's the implementation in java

public static List substringV3(String pattern, String text) {
    List indices = new ArrayList<>();
    if (pattern.length() > text.length()) return indices;

    Map> r = new HashMap<>();
    for (int i = 0; i < pattern.length(); i++) {
        char c = pattern.charAt(i);
        if (!r.containsKey(c))
            r.put(c, new LinkedList<>());
        LinkedList l = r.get(c);
        l.addFirst(i);
    }

    for (int n = pattern.length() - 1; n < text.length();) {
        int k = n; // index on text
        int i = pattern.length() - 1; // index on pattern
        while (i >= 0 && text.charAt(k) == pattern.charAt(i)) { i--; k--; }
        if (i < 0) {
            indices.add(k + 1);
            n = n + 1;
        } else {
            int ii = i;
            int l = r.getOrDefault(text.charAt(k), new LinkedList<>())
                    .stream()
                    .filter(j -> j < ii)
                    .findFirst()
                    .orElse(0);

            n = n + Math.max(1, i - l);
        }
    }

    return indices;
}

EDIT 2021-01-23:

I've updated the code to return the indices of instances of pattern found in text.