diff --git a/posts/suffix-array-searching-lemma/lcp-cases.png b/posts/suffix-array-searching-lemma/lcp-cases.png new file mode 100644 index 0000000..dca4d74 Binary files /dev/null and b/posts/suffix-array-searching-lemma/lcp-cases.png differ diff --git a/posts/suffix-array-searching-lemma/suffix-array-searching-lemma.org b/posts/suffix-array-searching-lemma/suffix-array-searching-lemma.org new file mode 100644 index 0000000..2002042 --- /dev/null +++ b/posts/suffix-array-searching-lemma/suffix-array-searching-lemma.org @@ -0,0 +1,199 @@ +#+title: A lemma on suffix array searching +#+filetags: @results suffix-array +#+OPTIONS: ^:{} num: num:t +#+hugo_front_matter_key_replace: author>authors +#+toc: headlines 3 +#+date: <2024-10-05 Sat> + + +We'll prove that using the "faster" binary search algorithm (see [[#faster-search]]) that tracks the LCP +with the left and right boundary of the remaining search interval has amortized +runtime + +$$ +O\Big(\lg_2(n) + |P| + \lg_2(Occ(P))\cdot |P|\Big), +$$ +when $P$ is a randomly sampled fixed-length pattern from the text and $Occ(P)$ counts the number of occurrences of $P$ in the text. + +Thus, when searching for patterns that follow the same distribution +as the input text, the performance of the faster search method is nearly as good as the +LCP-based $O(|P| + \lg_2 n)$ method when patterns have only few matches. + +First some background. + +* Suffix arrays + +Suffix arrays were introduced by [cite/t:@suffix-arrays-manber-myers-90]. +A suffix array $S$ of a text $T$ of length $n = |T|$ is a permutation of +$[n]=\{0, \dots, n-1\}$ such that $T[S[i]..] < T[S[i+1]..]$, that is, the suffixes +are sorted by the order $S$. + +Given the suffix array, one can search for a pattern $P$ of length $|P|$, +which returns the position $p$ of the /first/ suffix $T[S[p]..]\geq P$. + +In general, we are usually interested in the entire /interval/ of suffixes +$S[l..r]$ that start with the given pattern. This can simply be done by using +two independent searches, as so does not affect the theoretical complexity. But +of course, in practice it's more efficient to find both boundaries in one go. + +* Searching methods + +[cite/author/bc:@suffix-arrays-manber-myers-90] introduce three methods to +search the suffix array, all based on binary search. + +** Naive $O(|P|\cdot \lg_2 n)$ search + +The naive method works using a simple binary search on the suffix array. + +#+caption: Python code for searching a suffix array the naive way. +#+begin_src py +def search(T, S, P): + n = len(T) + l, r = 0, n + while l < r: + m = (l+r)//2 + M = T[S[m]:] + if M < P: + l = m + 1 + else: + r = m + return S[l] +#+end_src + +The binary search needs at most $\lceil \lg_2(n+1)\rceil$ iterations, and in +each iteration the slowest operation is the comparison of the suffix $M=T[S[m]..]$ +with the pattern =P=, which in the worst case compares $|P|$ characters in $O(|P|)$ time. +Thus, the overall runtime is $O(|P| \cdot \lg_2 n)$. + +** Faster $O(|P|\cdot \lg_2 n)$ search +:PROPERTIES: +:CUSTOM_ID: faster-search +:END: +One of the bad cases of the naive method above is that in each iteration, the +entire pattern $P$ is matched. As the remaining search interval $[l, r]$ +narrows, the remaining suffixes get more and more similar to $P$. In fact, once +we know that the first $c$ characters of $L=S[T[l]..]$ match the first $c$ +characters of $R=S[T[r]..]$, every remaining string in the interval also starts +with these same first $c$ characters, and so does $P$. +Thus, in the comparison of $M=T[S[m]..]$ and $P$, we can skip the first $c$ characters. + +In practice, we implement this by keeping the lengths $c_l$ and $c_r$ longest common prefix of $P$ with +$L$ and $R$, and skipping the first $\min(c_l, c_r)$ character comparisons. + +#+caption: Python code for searching a suffix array the faster way. +#+begin_src py +def search(T, S, P): + n = len(T) + l, r = 0, n + c_l, c_r = 0, 0 + while l < r: + m = (l+r)//2 + M = T[S[m]:] + c = min(c_l, c_r) + # The lcp and the comparison can be computed at the same time. + c_m = c + lcp(M[c:], P[c:]) + if M[c:] < P[c:]: + l = m + 1 + c_l = c_m + else: + r = m + c_r = c_m + return S[l] +#+end_src + +In the worst case, this method is still $O(|P| \cdot \lg_2 n)$, but it's widely +known that this method performs very well in practice, and can even be faster than +the $O(|P|+\lg_2 n)$ method below. + +** LCP-based $O(|P| + \lg_2 n)$ search + +Now suppose that for every step in the binary search, we have already precomputed values +$x_l=LCP(L, M)$ and $x_r=LCP(M, R)$. +(This is possible in $O(n)$ time and space using a bottom-up approach on the +implicit binary search tree that contains around $n$ nodes.) + +Now assume that in a step of the binary search, we have $l\geq r$. +Again following the original paper, there are three cases, as they nicely +illustrated. + +#+caption: Three cases of the binary search, taken from [cite/t:@suffix-arrays-manber-myers-90]. +#+attr_html: :class inset large +[[file:lcp-cases.png]] + +The black bars indicate the length of the LCP of each suffix with $P$, and $l$ +and $r$ are the length of the LCP of the left and right of the interval with +$P$. Assume that $l\geq r$ (the symmetric case is equivalent). +The grey area is the length of the LCP of $L=T[S[l]..]$ and $M=T[S[m]..]$. + +Let $x = LCP(L, M)$. The three cases are, in order: +- $x > l$ :: In this case, we know that $P$ is larger than $L$ in the + $l+1$'st character, and since $x>l$, $L$ and $M$ are equal in their $l+1$'st + character, so also $P$ is larger than $M$ in its $l+1$'st character and + $P>M$, so we branch right. +- $x=l$ :: In this case, we know that $P$ shares the first $x$ characters with + $L$ and hence also with $M$. We now compare $P$ with $M$ starting at the + $l+1$'st character. Let's say that $h$ equal characters are compared. + If we branch left, the new value of $r$ is $l+h$, and if we branch right, the + new value of $l$ is $l+h$. Thus, $\max(l, r)$ always increases from $l$ to $l+h$. +- $xn-m_P$, the pattern is simply a bit shorter than $m_P$. In this case, +we assume that $P$ includes a sentinel character at the end, and thus will have +exactly $1$ occurrence in the text. + +Consider a step in the binary search, and assume again that $l \geq r$. We start +comparing $P$ and $M$ at their $\min(l,r)+1=r+1$'st character, and let $y=LCP(P, +M)$ be the computed length of the LCP of $M$ and $P$, which requires $h=y-r$ +comparisons of equal characters. We also still define $x=LCP(L, M)$ +although we do not explicitly compute this. There are a few cases: + +- $y < m_P$ :: + When $M$ does not start with $P$, we know that the pattern is larger or + smaller than $M$ with equal probability, since the pattern was randomly + sampled from the suffix array and thus each of the $m-l$ position left of $M$ + and each of the $r-m$ positions right of $M$ has an equal probability of corresponding + to the chosen pattern. (In the case where $r-l$ is odd, we can randomly choose + between $m=\lfloor \frac{l+r}2\lfloor$ and $m=\lceil \frac{l+r}2\lceil$ to + truly equalize the probabilities.) + + Thus, with probability $1/2$, the minimum of $l$ and $r$ increases to $y$, and + so in expectation, the sum of $l$ and $r$ increases by at least $(y-r)/2 = + h/2$. Since the sum of $l$ and $r$ is at most $2|P|$, the expected total number of + comparisons is at most $4|P|$. + +- $y = m_P$ :: + When $M$ starts with $P$, we trivially do at most $|P|$ comparisons. + When there are $Occ(P)$ occurrences of the pattern $P$, this situation + can happen at most $\lg_2 Occ(P)$ times, and so this incurs a total cost + bounded by $O(|P| \cdot \lg_2 Occ(P))$. + +We conclude that when we fix a length $m_P$ and uniformly random choose a pattern $P$ length $m_P$ random from the input +text, the amortized cost of a search is +$$ +O\Big(\lg_2(n) + m_P + \lg_2(Occ(P))\cdot m_P\Big). +$$ + + +#+print_bibliography: diff --git a/references.bib b/references.bib index 5528e73..3fdfe61 100644 --- a/references.bib +++ b/references.bib @@ -3856,3 +3856,18 @@ @Article{suffix-arrays-with-a-twist url = {http://dx.doi.org/10.31577/cai_2019_3_555}, publisher = {Central Library of the Slovak Academy of Sciences} } + +@Article{suffix-arrays-manber-myers-90, + author = {Manber, Udi and Myers, Gene}, + title = {Suffix Arrays: A New Method for On-Line String Searches}, + journal = {SIAM Journal on Computing}, + year = 1993, + volume = 22, + number = 5, + month = oct, + pages = {935–948}, + issn = {1095-7111}, + doi = {10.1137/0222058}, + url = {http://dx.doi.org/10.1137/0222058}, + publisher = {Society for Industrial & Applied Mathematics (SIAM)} +}