zalgorithm
This post contains my notes and implementation of the zalgorithm.
Similar to the Bad Character Rule post this post is inspired by Gusfield’s book.
The code can be found on github.
The zalgorithm finds in linear time if a pattern
is a substring of text
.
It starts by creating the string s
following the recipe pattern$text
. The
character $
cannot exist in pattern
or text
. The choice of $
has a few
interesting implications as we’ll see later.
After s
is built, the algorithm creates an array z
with the same length of
s
. The index r[i]
, for i > 0
, is the length of the longest substring
starting at i
that is also a prefix of s
.
For example:
i 0123456789
aabaaxaaba
z 0102104101
After z
is built, the algorithm iterates the array and if the expression
r[i] == pattern.length
yields true an instance of pattern
has be found as a
substring of text
.
The array z
can be built in linear time using the concept of a zbox. A zbox
is a substring that is a prefix of the string. For example:
01234567
abcxabcy
 

zbox
The substring abc
defines a zbox starting at l = 4
and ending at z = 6
.
The trick to calculate the rarray in linear time is to rely in the zbox
calculated during the previous iteration, i  1
, when calculating r[i]
. We
use the variables l
and r
to define the boundaries of the zbox.
There are two base cases the zalgorithm has to handle when calculating r[i]
.

If
i > r
, that means thati
is outside the previous zbox and the algorithm compares the characters against the prefix ofs
. Then, ifz[i] > 0
a new zbox is found andl
andr
are updated accordingly. 
If
i <= r
, that means thati
is inside of the previous zbox and there are two cases to be handled based on the value ofz[i  l]
andbeta = r  i + 1
. Ifz[i  l] < beta
we’ve found a new zbox. Otherwise, we have a new zbox that starts ati
but that might be larger than the current value ofr
.
On a first read that sounds complicated but the idea is simple. The expression
i  l
denotes the position of the character s(i)
in the prefix of the
string. The expression that defines beta is r  i + 1
. The beta variable is
the length defined from the index i
up to the end of the zbox.
So, z[i  l] < beta
indicates that no comparisons are necessary to calculate
z[i]
and we can define z[i] = z[i  l]
. On the other hand, if z[i  l] >= beta
, z[i]
is at least the same length of beta.
Before we implement the function that calculates the zarray there’s one more
problem to tackle. The algorithm requires a character that does not appear in
text
or pattern
. Which character should we use?
Java strings are represented in UTF16 which guarantees that the values between
U+D800
and U+DFFF
are reserved code points and will never be assigned a
character. A value in this range is a good candidate to be used as the separator
$
in S
. The problem is that this is not future proof. For example, the
string representation could be updated to
use UTF8 instead of UTF16 in a future JDK release where the range
U+D800U+DFFF
is not reserved.
One alternative is to use an int instead of chars and pick a separator that is greater than 16 bits. This approach works albeit with greater memory usage.
If instead of working with chars we’re interested in code points and returning code point indexes using ints is the alternative to follow.
The range of legal code points is U+0000
to U+10FFFF
and according to the
Character java class the lower 21 bits of an int are used to represent Unicode
code points and the most significant 11 bits must be zero. In this case, the
value FFE0 << 16
is a good candidate as a separator.
In our implementation we avoid having to pick an unique character identifier by
calculating the zarray
in two steps. We calculate the zarray
of s
in
two steps. We first calculate the zarray
slice that corresponds to the
pattern
followed by the remaining values of the zarray
that represents the
text
. The position of the separator character is left untouched and carries
the value zero.
The implementation follows
public class ZAlgorithm {
private static void calculateZ(int[] z, char[] s, int ini, int end) {
int l = 0; // zbox left
int r = 0; // zbox right
for (int i = ini; i <= end; i++) {
if (i > r) { // try to find a new box zbox
l = r = i;
while (r <= end && s[r] == s[r  l]) r++;
z[i] = r  l;
r;
} else { // inside a zbox
int k = i  l;
if (z[k] < (r  i + 1)) {
z[i] = z[k];
} else {
l = i; // calculate new zbox
while (r <= end && s[r] == s[r  l]) r++;
z[i] = r  l;
r;
}
}
}
}
private static int[] zarray(String pattern, String text) {
int plen = pattern.length();
int tlen = text.length();
int slen = plen + tlen + 1;
char[] s = new char[slen];
pattern.getChars(0, plen, s, 0);
text.getChars(0, tlen, s, plen + 1);
int[] r = new int[slen];
calculateZ(r, s, 1, plen  1); // pattern
calculateZ(r, s, plen + 1, slen  1); // text
return r;
}
public static List<Integer> issubstring(String pattern, String text) {
List<Integer> indices = new ArrayList<>();
if (pattern.length() > text.length()) return indices;
int[] z = zarray(pattern, text);
for (int i = pattern.length() + 1; i < z.length; i++)
if (z[i] == pattern.length())
indices.add(i  pattern.length()  1);
return indices;
}
}
As a curiosity, according to Gusfield the zalgorithm was first introduced in An O(n log n) Algorithm for Finding All Repetitions in a String.
EDIT: 20210109
I’ve added an optimized version of the zalgorithm
ZAlgorithmV2. It
calculates the zarray
only for the pattern while also avoiding unnecessary
memory allocations.
EDIT: 20210122
I’ve updated the code to return the indices of instances of pattern
found in
text and fixed a few broken links.