-
Notifications
You must be signed in to change notification settings - Fork 0
/
Indexing Problems.tex
118 lines (103 loc) · 10.7 KB
/
Indexing Problems.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
\documentclass{article}
\usepackage{fullpage}
\usepackage{amsthm, amsmath}
\usepackage{url}
\newtheorem{claim}{Claim}
\newtheorem{definition}{Definition}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}
\newtheorem{observation}{Observation}
\newtheorem{question}{Problem}
\title{Indexing Problems via Function Inversion}
\author{Sam McCauley and Peter Zhao}
\date{}
\begin{document}
\maketitle
\section{Indexing Problems}
\label{sec:indexing}
Indexing problems are online variants of certain problems that generally modify the problems so that we are given a set of data, and we must query certain results from the data. These problems are framed so that there are two phases: a preprocessing phase and an online phase. In the preprocessing phase, we are allowed to perform arbitrary computation on the given data and construct a data structure of space $S$. Then, in the online phase, we want to be able to answer a query in $T$ time by using this data structure.
These problems are worth studying because often, we are faced with querying large sets of data for a specific result, as if they were stored in a database, and we would like to optimize on how much space the preprocessed data structure requires as well as how long a query takes in the online phase. This time space tradeoff becomes the cornerstone of these problems, and improvements include reducing both space and time further than what the current tradeoff curve allows.
\section{Time Space Tradeoffs}
\label{sec:tradeoffs}
There are trivially two main ways to solve these problems, at either end of the tradeoff curve. We can either:
\begin{enumerate}
\item Preprocess all the data and store into a table, so that query times are constant (but takes the maximum amount of space)
\item Perform no preprocessing and use no space, but require the maximum time required to check each element.
\end{enumerate}
There is some vagueness in the above descriptions, so we shall define some terms before we clarify. We have a set of data $D$ in which we want to check for membership for a specific query.
For example, we might have a set of students, and we want to check if given a birthday query, whether or not anyone was born on that day. Assume that we have an oracle function \texttt{calculateBirthday()} that takes in a student, and returns a birthday. An easy way to solve this problem is to preprocess by taking all the students, and store all their birthdays in a dictionary, so that we can query birthdays in constant time. Another method is that we can iterate through all the students, and check if their birthday is equal to our query by continuously applying \texttt{calculateBirthday()}. These correspond to methods 1 and 2 above at opposite sides of the tradeoff curve. Since we are not given the set necessary to answer queries, we must apply some function to transform it into the necessary query type. Often, the size of the set after naive preprocessing into our query type is larger than the input, as the image of the problem is larger than the original input data. Thus, we likely need to apply a procedure to the original data so that it can be appropriately converted to our query type via an oracle function. \footnote{In our birthday example, this is not true, as there are a constant number of birthdays, hence why we do not need a procedure}
This gives us the notion of original, implicit, and explicit forms. Again, assume we have an oracle function that converts our into the form we need for answering our query. We also presume the potential need for a procedure, which we will heavily depend on the problem data, as well as the type of query required. It may be the case that for the same problem data, we seek a different query, and so the procedure might be different.
\begin{enumerate}
\item Original form $D_o$ represent the raw input data.
\item Implicit form $D_i$ represents the given preimage of the problem, where we convert from the original form via some procedure. We can map this preimage to the image (the explicit form) via an oracle function.
\item Explicit form $D_e$ represents the form of the queries that we expect to receive, and the size of its domain is equivalent to the size needed in a lookup table.
\end{enumerate}
Diagrammatically:
$$\text{Original data} \rightarrow_{procedure} \text{Implicit Form} \rightarrow_{oracle} \text{Explicit Form}$$
All the procedure does is convert the input data into the necessary implicit form so that once we apply the oracle function to the implicit form, we immediately receive the explicit form that matches our query type so that if we store the explicit form into a table, we achieve constant query time and space usage proportional to the size of the explicit form's domain. We look to achieve less space than the oracle function's domain and range (the explicit and implicit forms respectively), and/or less time linear to the size of the original input data.
Thus, all indexing problem will have this structure, where we look to preprocess our input data into a dictionary data structure via an oracle function and a procedure. We will apply this concept to a few concrete indexing problems, making clear the necessary pieces that make it an indexing problem.
%%%%%%%%%%%%%TODO%%%%%%%%%%%%%%%%%%
\section{Examples}
\label{sec:examples}
% 3SUM Indexing
\begin{definition}[3SUM Indexing]
Given two lists of integers $A,B$ of length $n$, determine for a query $q$ if there exists $(a,b) \in A,B$ such that $a+b=q$.
\end{definition}
The indexing version of this classic problem has the following features:
\begin{itemize}
\item Original data: two lists $A,B$
\item Implicit form: A list of pairs of elements from $A,B$
\item Explicit form: the sumset $A+B$ which contains all the sums of each pair of elements in $A,B$
\item Procedure: Construct a list of all pairs of elements from $A,B$
\item Oracle function: the addition function $f(a,b) = a+b$
\end{itemize}
It can be solved in linear space and time by sorting and storing the lists, and using two pointers to answer the query. It can also be solved in quadratic space but constant time by storing the sumset.
% OV Indexing
\begin{definition}[Orthogonal Vectors Indexing]
Given a set $S$ that contains $n$ $d$-length boolean vectors, determine if there exists a vector in $S$ which is orthogonal to a query that is a $d$-length boolean vector.
\end{definition}
Orthogonal vectors are vectors such that their dot product is 0, or that they differ in every location with a 1 (so that when u perform a dot product, the result is 0).
\begin{itemize}
\item Original data: a list of $d$-length boolean vectors
\item Implicit form: The list of $d$-length boolean vectors with each element repeated $2^k$ times, where $k \leq d$ is the number of zeros in the specified vector
\item Explicit form: the list of all vectors orthogonal to the input list. This can be as large as all possible boolean vectors of length $d$, which gives us $2^d$ possible vectors.
\item Procedure: For each vector, repeat the vector $2^k$ times where $k \leq d$ is the number of zeros in the vector. Label each of these vectors so that they are indexed, and the index specifies the vector it is orthogonal to in lexicographic order.
\item Oracle function: the function mapping vectors to their orthogonal vectors $f(v_i) = w_i$ such that $w \cdot v_i = 0$, where $i$ is the index of the vector in lexicographic order.
\end{itemize}
OV-Indexing can be solved in linear time and space by storing the input vectors and performing the dot product between the query and input vectors. It can also be solved in $2^d$ space and constant time by storing all the orthogonal vectors.
% Substring Indexing
\begin{definition}[Substring Indexing]
Given a bitstring of length $n$ which represents the text, determine for a pattern bitstring of length $p \ll n$, if the pattern appears at an index in the text.
\end{definition}
This problem is also the problem of pattern matching or substring searching, but an online variant.
\begin{itemize}
\item Original data: the given text
\item Implicit form: a set of pairs of indicies $(i,j)$ representing the beginning and end of a substring
\item Explicit form: the list of all substrings
\item Procedure: For each $i,j \in n$ such that $i<j$, store $(i,j)$ in a list. This list of indices corresponds to a substring.
\item Oracle function: the function converting a substring index to a substring of length $p$
\end{itemize}
Substring Indexing can be solved in linear time proportional to the size of the text and pattern. We can also solve it in linear extra space and time proportional to the size of the text by storing the suffix tree of the text, which is an efficient substring index method.
% Jumbled Indexing
\begin{definition}[Jumbled Indexing]
Given a text $S$, determine if given a string $v$, whether or not there exists a substring in $S$ such that is jumble matches with $v$.
\end{definition}
Two strings jumble-match if one can permute the characters of one string so that it is equal to the other. For example $abc$ and $bca$ jumble match because we can rewrite $bca$ as $abc$. These vectors can also be called commutatively equivalent.
\begin{itemize}
\item Original data: the text $S$
\item Implicit form: a set of pairs of indicies $(i,j)$ representing the beginning and end of a substring
\item Explicit form: the list of \textit{Parikh vectors}, or vectors that maintain frequency count of all characters
\item Procedure: For each $i,j \in n$ such that $i<j$, store $(i,j)$ in a list. This list of indices corresponds to a substring.
\item Oracle function: the function converting two indices to a \textit{Parikh vector}.
\end{itemize}
Jumbled Indexing can be solved in similar manners to Substring Indexing, except now we are not looking for string matches, but rather character frequency matches.
\section{Function Inversion}
\label{sec:solution}
If a problem has the following structure above (and thus has "indexing" appended to its name), then we can potentially solve it with a better tradeoff by applying an algorithm developed by Fiat Naor for inverting functions. In order to do so, we attempt to reduce the problem to the problem of inverting functions, and when we have this format, we can immediately apply the Fiat Naor algorithm, and receive a tradeoff of $TS^3 = N^3$.
We state Fiat-Naor's theorem again here for clarity.
\begin{theorem}
For any function $F:X \rightarrow X$, and for any choice of values $S,T$ usch that $S^3T \geq |X|^3$, there is a deterministic data structure with space $O(S)$ which allows inverting $F$ at every point making $O(T)$ queries to the memory cells and evaluations of $F$.
\end{theorem}
Want to be able to immediately plug in sizes of input and oracle functions to get tradeoff.
\end{document}