This post contains my notes and implementation of the bad character rule algorithm. It's inspired by Algorithms on Strings, Trees and Sequences by Dan Gusfield.

The source code can be found on github.

Let's start with a naive O(nm) implementation of `substring`

public static ListsubstringV1(String pattern, String text) { List indices = new ArrayList<>(); int n = text.length() - pattern.length(); for (int i = 0; i <= n; i++) { int j = i, k = 0; while (k < pattern.length() && text.charAt(j++) == pattern.charAt(k++)); if (k == pattern.length()) indices.add(i); } return indices; }

In the naive implementation when a mismatch occurs the starting point of
comparison in text (`i`

) is increased by one while the comparison with the
pattern string happens in the inner while loop.

The bad character rules is an improvement over the naive implementation. In some situations it can reduce the number of character comparisons by shifting the pattern to the right by more than one character when a mismatch occurs.

The bad character rule requires the following changes to the naive implementation

1. pre-process the pattern string registering the last index occurrence of each character. 1. compare characters right to left. 3. when a mismatch occurs, if the character from text exists in the pre-processed pattern and its position in the pre-processed pattern is to the left of the current pattern position, the pattern can be shifted to the right so that the characters are aligned. Otherwise the pattern is shifted to the right by one position.

An example should make this easier to understand. Suppose we have the text
`AACCCBAAAAD`

and the pattern `BAAAAD`

.

First, we pre-process the pattern generating the following mapping between characters and their highest index positions in the string:

B -> 0 A -> 4 D -> 5

Most textbooks recommend using an array for storing this mapping. This works fine if we're dealing with a small alphabet such as ascii. Most likely though, we're dealing with unicode and the memory usage becomes prohibitive. In our implementation we use a hash map.

After the mapping is built, we start the comparison at position 5. There we have
a mismatch since text(5) is `B`

and pattern(5) is `D`

.
We check the mapping for the character `B`

and it points to the
position `0`

in the pattern. Then we shift the pattern to the right
skipping the text prefix `AACCC`

.

# beginning of first comparison iteration 01234567890 AACCCBAAAAD BAAAAD # beginning of second comparison iteration 01234567890 AACCCBAAAAD BAAAAD

Finally, the algorithm compares text at indexes from 10 to 5 against the pattern from indexes 5 to 0. In this case there's a match and the algorithm returns true.

Here's one possible implementation in java

public static ListsubstringV2(String pattern, String text) { List indices = new ArrayList<>(); if (pattern.length() > text.length()) return indices; Map r = new HashMap<>(); for (int i = 0; i < pattern.length(); i++) r.put(pattern.charAt(i), i); for (int n = pattern.length() - 1; n < text.length();) { int k = n; // index on text int i = pattern.length() - 1; // index on pattern while (i >= 0 && text.charAt(k) == pattern.charAt(i)) { i--; k--; } if (i < 0) { indices.add(k + 1); n = n + 1; } else { n = n + Math.max(1, i - r.getOrDefault(text.charAt(k), 0)); } } return indices; }

The extended bad character rule deals with the case where on a mismatch the character from t is found to the right of the current index of the pattern string.

Suppose we have the pattern `BABCCCAAB`

and the text `CCCCCCBABCCAAB`

. The
algorithm first compares the caracters at position 8 followed by position 7. Both characters match the pattern.

01234567890123 CCCCCCBABCCAAB BABCCCAAB

Next, at position 6 there's a mismatch. If we follow the bad character rule we
find that `B`

is to the right of the current index of pattern. Of course, the
algorithm wouldn't shift the pattern to the left and instead it would move the
pattern to the right by only one position.

The extended bad character rule works around this problem by keep tracking of each character in the pattern and all of their positions. In our implementation we use a linked list to store the mapping. Another alternative is to use binary search tree.

In this example, the pre-processing of the pattern yields the following map

B -> (0, 2, 8) A -> (1, 6, 7) C -> (3, 4, 5)

Finally, when a mismatch occurs and the character in text is found in the mapping, we walk the linked list trying to find the first index that is to the left of the current pattern index. If one doesn't exist we shift the pattern to the right by one position.

# beginning of first comparison iteration 01234567890123 CCCCCCBABCCAAB BABCCCAAB # beginning of second comparison iteration 01234567890123 CCCCCCBABCCAAB BABCCCAAB

Notice that because of the extended rule the algorithm shifted the pattern by 4 characters to the right while the standard bad character rule would only shift the pattern by one position.

Here's the implementation in java

public static ListsubstringV3(String pattern, String text) { List indices = new ArrayList<>(); if (pattern.length() > text.length()) return indices; Map > r = new HashMap<>(); for (int i = 0; i < pattern.length(); i++) { char c = pattern.charAt(i); if (!r.containsKey(c)) r.put(c, new LinkedList<>()); LinkedList l = r.get(c); l.addFirst(i); } for (int n = pattern.length() - 1; n < text.length();) { int k = n; // index on text int i = pattern.length() - 1; // index on pattern while (i >= 0 && text.charAt(k) == pattern.charAt(i)) { i--; k--; } if (i < 0) { indices.add(k + 1); n = n + 1; } else { int ii = i; int l = r.getOrDefault(text.charAt(k), new LinkedList<>()) .stream() .filter(j -> j < ii) .findFirst() .orElse(0); n = n + Math.max(1, i - l); } } return indices; }

*EDIT 2021-01-23*:

I've updated the code to return the indices of instances of `pattern`

found in text.

©2023 daniberg.com