This post contains my notes and implementation of the z-algorithm.

Similar to the Bad Character Rule post this post is inspired by Gusfield's book.

The code can be found on github.

The z-algorithm finds in linear time if a `pattern`

is a substring of `text`

.

It starts by creating the string `s`

following the recipe
`pattern$text`

. The character `$`

cannot exist in
`pattern`

or `text`

. The choice of `$`

has a few interesting implications as we'll see later.

After `s`

is built, the algorithm creates an array `z`

with the same length of `s`

. The index `r[i]`

,
for `i > 0`

, is the length of the longest substring
starting at `i`

that is also a prefix of `s`

.

For example

i 0123456789 aabaaxaaba z 0102104101

After `z`

is built, the algorithm iterates the array and if the expression
`r[i] == pattern.length`

yields true an instance of `pattern`

has be found as a substring of `text`

.

The array `z`

can be built in linear time using the concept of a z-box. A z-box
is a substring that is a prefix of the string. For example

01234567 abcxabcy | | --- z-box

The substring `abc`

defines a z-box starting at `l = 4`

and ending at `z = 6`

.

The trick to calculate the r-array in linear time is to rely in the z-box
calculated during the previous iteration, `i - 1`

, when calculating
`r[i]`

. We use the variables `l`

and `r`

to
define the boundaries of the z-box.

There are two base cases the z-algorithm has to handle when calculating `r[i]`

.

1. If `i > r`

, that means that `i`

is outside the previous z-box and the
algorithm compares the characters against the prefix of `s`

. Then, if
`z[i] > 0`

a new z-box is found and `l`

and `r`

are updated
accordingly.

2. If `i <= r`

, that means that `i`

is inside of the previous
z-box and there are two cases to be handled based on the value of `z[i - l]`

and `beta = r - i + 1`

. If `z[i - l] < beta`

we've found a new z-box.
Otherwise, we have a new z-box that starts at `i`

but that might be larger than
the current value of `r`

.

On a first read that sounds complicated but the idea is simple. The expression
`i - l`

denotes the position of the character `s(i)`

in the
prefix of the string. The expression that defines beta is `r - i + 1`

.
The beta variable is the length defined from the index `i`

up to the end of the z-box.

So, `z[i - l] < beta`

indicates that no comparisons are necessary to calculate
`z[i]`

and we can define `z[i] = z[i - l]`

. On the other hand,
if `z[i - l] >= beta`

, `z[i]`

is at least the same length of beta.

Before we implement the function that calculates the z-array there's one more
problem to tackle. The algorithm requires a character that does not appear in
`text`

or `pattern`

. Which character should we use?

Java strings are represented in UTF-16 which guarantees that the values between
`U+D800`

and `U+DFFF`

are reserved code points and will never
be assigned a character. A value in this range is a good candidate to be used as the separator
`$`

in `S`

.
The problem is that this is not future proof. For example, the
string representation could be updated to
use UTF-8 instead of UTF-16 in a future JDK release where the range
`U+D800-U+DFFF`

is not reserved.

One alternative is to use an int instead of chars and pick a separator that is greater than 16 bits. This approach works albeit with greater memory usage.

If instead of working with chars we're interested in code points and returning code point indexes using ints is the alternative to follow.

The range of legal code points is `U+0000`

to `U+10FFFF`

and
according to the Character java class the lower 21 bits of an int are used to represent
Unicode code points and the most significant 11 bits must be zero. In this case, the
value `FFE0 << 16`

is a good candidate as a separator.

In our implementation we avoid having to pick an unique character identifier by
calculating the `z-array`

in two steps. We calculate the `z-array`

of `s`

in two steps. We first calculate the `z-array`

slice that
corresponds to the `pattern`

followed by the remaining values of the
`z-array`

that represents the `text`

. The position of the
separator character is left untouched and carries the value zero.

The implementation follows

public class ZAlgorithm { private static void calculateZ(int[] z, char[] s, int ini, int end) { int l = 0; // z-box left int r = 0; // z-box right for (int i = ini; i <= end; i++) { if (i > r) { // try to find a new box z-box l = r = i; while (r <= end && s[r] == s[r - l]) r++; z[i] = r - l; r--; } else { // inside a z-box int k = i - l; if (z[k] < (r - i + 1)) { z[i] = z[k]; } else { l = i; // calculate new z-box while (r <= end && s[r] == s[r - l]) r++; z[i] = r - l; r--; } } } } private static int[] zarray(String pattern, String text) { int plen = pattern.length(); int tlen = text.length(); int slen = plen + tlen + 1; char[] s = new char[slen]; pattern.getChars(0, plen, s, 0); text.getChars(0, tlen, s, plen + 1); int[] r = new int[slen]; calculateZ(r, s, 1, plen - 1); // pattern calculateZ(r, s, plen + 1, slen - 1); // text return r; } public static Listissubstring(String pattern, String text) { List indices = new ArrayList<>(); if (pattern.length() > text.length()) return indices; int[] z = zarray(pattern, text); for (int i = pattern.length() + 1; i < z.length; i++) if (z[i] == pattern.length()) indices.add(i - pattern.length() - 1); return indices; } }

As a curiosity, according to Gusfield the z-algorithm was first introduced in An O(n log n) Algorithm for Finding All Repetitions in a String.

*EDIT: 2021-01-09*

I've added an optimized version of the z-algorithm
ZAlgorithmV2. It
calculates the `z-array`

only for the pattern while also avoiding unnecessary
memory allocations.

*EDIT: 2021-01-22*

I've updated the code to return the indices of instances of `pattern`

found in
text and fixed a few broken links.

©2022 daniberg.com