By the end of this unit, you should be able to
- assess estimated cost of database operation, namely, select, sort and join
- describe the external sort algorithm
- describe nested loop join, block nested loop join, index nested join, sort merge join and hash join
Recall that in week 2, we investigated into relation model and relational alegbra. We argue that relational algebra is a good abstraction to describe data processing operation over a relational model.
Given that we know a bit more of the internal implementation of a physical database, (the data structure aspect) we start look into possible ways of implemeting the relational algebra operations in a physical datase (the algorithm aspect).
Let
There are several ways to implement this operation.
Knowing that the data in
The cost (in terms of number of Disk I/Os) of using the sequential scanning approach is
A smarter (but not always) way is to make use to the B+ tree index created over the attributes mentioned in
In this case the B+ Tree index is not clustered, i.e. the data in the pages are not stored according to the order of the index attribute.
We have to traverse down the B+ Tree to find the boundaries of
Assuming each node in the B+ Tree occupies a page, the cost of this approach is
Where
The first term
When the B+ Tree is not clustered, the estimated cost becomes
A page read is needed for every tuple (in the worst case).
As we can see that when
One possible improvement of using unclustered index is
- traverse down the B+ Tree to find the correct starting and ending tuple locations (i.e. page id and slot id),
- sort all the tuple locations by page ids before
- reading the sorted pages and filter.
This will bring us back to the same estimated cost as the clustered index (modular the cost of sorting).
Sorting is essential.
From Data Driver World and Algorithms, we learned various sorting algorthms. All of these algorithms assumes that the data can fit in the RAM. In the settings of database operations, this assumption is no longer valid.
The external sorting algorithm is an instance of the merge sort.
The merge sort algorithm is as follows
- def merge_sort(input)
- if input size < 2
- return input, it is already sorted
- otherwise
- divide input into halves, say, i1 and i2
- call merge_sort(i1) to get s1
- call merge_sort(i2) to get s2
- combine(s1,s2) to get sorted
- return sorted
- if input size < 2
- def combine(s1, s2)
- create an empty list out
- while s1 is not empty and s2 is not empty
- let e1 be the first element of s1
- let e2 be the first element of s2
- if e1 < e2
- add e1 to out
- remove e1 from s1
- otherwise
- add e2 to out
- remove e2 from s2
- if s1 is not empty
- concatenate s1 to out
- if s2 is not empty
- concatentate s2 to out
- return out
The external sort algorithm is a generalized version of merge sort, which operates without loading the full set of data into RAM.
The algorithm makes use of the buffer pool.
Let's consider an example. Suppose we have a buffer pool of 3 frames and would like to sort the following 8 pages of data. Each page contains two tuples.
In this phase, we divide the input pages into $\lceil8/3 \rceil = 4 $ runs and sort each run.
From the next phase onwards, we merge each the sorted runs into larger runs, until all are merged into a single one run.
We merge 3 runs into two runs.
Firstly we merge the two runs on the left into one.
When merging two runs into one, we divide the buffer pool's frames into the input frames and output frame. There is only one output frame, the rest are inputs.
In this running example, we have two input frames and one output frame.
We merge the run [(2,3), (4,6), (9,9)]
with [(3,5), (6,7), (8,11)]
.
We use list notation to denote a run, and a tuple to denote a page.
- We load the page
(2,3)
into an input frame and(3,5)
into another input frame. - We find the smallest leading tuple from both frames, i.e.
2
and move it in the output frame - We repeat the same operation, and find
3
, which is moved to the output frame - The output frame is full and write it to disk.
- The first input frame is not empty (because, we assume both
2
and3
are moved to the output frame). We load the second page(4,6)
into this frame. - The process is repeated until both runs are processed.
The 3rd run [(0,1), (2,)]
is alone, hence there is no merging required.
At the end of phase 1, we have a new set of runs
[ [(2,3), (3,4), (5,6), (6,7), (8,9), (9,11)], [(0,1), (2,1)] ]
In this phase we merge the output from the phase into a single run
We consider the pseudo-code of the exernal sort algorithm. The algorithm is defined by a pseudo function ext_sort(in_pages, bpool), which has the following input and output.
- input
- bpool - the buffer pool (in RAM)
- in_pages - the pages to be sorted (on disk)
- output
- sorted results (on disk)
We find the pseudo code as follows
- def ext_sort(in_pages, bpool)
- let runs = divide_n_sort(in_pages, bpool)
- while size_of(runs) > 1
- let runs = merge(runs, bpool)
- return runs
At step 1.1, we call a helper function divide_n_sort(in_pages, bpool) to generate the initial runs, (phase 0).
Steps 1.2 to 1.3 define the merging phases (phase 1, phase 2, .., etc). We repeatly call the helper function merge(runs, run_size, bpool)
to merge the current runs set until all runs are merged into a single run.
Next we consider the helper function divide_n_sort(in_pages, bpool), which has the following input and output
- input
- bpool - the buffer pool (in RAM)
- in_pages - the pages to be sorted (on disk)
- output
- runs - list of lists (on disk). Each inner list (a run) consists of a set of sorted data. (on disk)
- def divide_n_sort(in_pages, bpool)
- let count = 0
- let m = size_of(bpool)
- initialize runs to be an empty list (on disk)
- while (m * count) < size_of(in_pages)
- load the m pages from in_pages at offset (m * count)
- sort data in bpool
- group these m sorted pages into one run (on disk)
- append run to runs (on disk)
- increase count by 1
- return runs
We consider the pseudo code of the function merge, merge(runs, bpool)
- input
- bpool - the buffer pool (in RAM)
- runs - the list of lists (on disk), each inner list (a run) consists of a set of sorted data. (on disk)
- output
- next_runs - the runs in next phase (on disk)
- def merge(runs, bpool)
- initialize next_runs to be an empty list (on disk)
- let m = size_of(bpool)
- let l = size_of(runs)
- divide bpool into m-1 and 1 frames
- let in_frames = m-1 frames of bpool for input
- let out_frame = 1 frame of bpool for output
- let count = 0
- while (m-1)*count < l
- let batch = extract m-1 runs from runs at offset (m-1) * count
- let out_run = merge_one(batch, in_frames, out_frame)
- append out_run to next_runs
- increment count by 1
- return next_runs
The merge function processes the runs by merging every batch (consist of b-1 runs) into a larger run for the next phase, where b is the size of the buffer pool. It utilizes another helper function merge_one(runs, bpool) to merge a batch, which has the following signature and definition
- input
- batch - a segment from global runs. In variant, size_of(batch) <= size_of(in_frames)
- in_frame - frames set aside for input pages
- out_frame - output frame set aside for output.
- output
- output_run (on disk)
- def merge_one(batch, in_frames, out_frame)
- initialize output_run to be an empty collection
- let m = size_of(in_frames)
- while (batch is not empty) and not (all in_frames isempty)
- for i in (0 to m)
- if in_frames[i] is empty and batch[i] is not empty
- load a page batch[i] to in_frames[i]
- if in_frames[i] is empty and batch[i] is not empty
- find the smallest record from the leading tuples from in_frames[0] to in_frames[m], move it to out_frame
- if out_frame is full, write it to output_run
- for i in (0 to m)
- return output_run
Assuming the size of in_pages is
The ext_sort function has two main parts. The phase 0, the divide_n_sort function call and the while loop, i.e. phase i where i>0.
- In divide_n_sort, we read each page to be sorted once and write back to disk once. Hence the cost is
$2 \cdot n$ . - In each iteration of the while loop in ext_sort function, we merge every
$m-1$ runs into a single run, until all runs are merged. There should be$\lceil log_{(m-1)}(\lceil n/m \rceil) \rceil$ iterations. For each iteration, we read and write each page once. Hence the cost is$2 \cdot n \cdot \lceil log_{(m-1)}(\lceil n/m \rceil)\rceil$ .
Lastly, we consider the implementation of the join operation.
Let
- The most naive implementation is to compute
$\sigma_{c}(R \times S)$ . It is costly, because of the computation of cartesian product. - Alternatively, we may use a nested loop
One possible approach is to use nested loop
- for each tuple
$t$ in$R$ - for each tuple
$u$ in$S$ - if
$t$ and$u$ satisfy$c$ , output$t$ and$u$ .
- if
- for each tuple
The cost of this approach is
If we flip the outter/inner relation roles of
One obvious issue with the nested loop join is that it only utilizes three frames of the buffer pool, 1 frame for
We could speed up nested loop join by utilizing all frames of the buffer pool.
Assuming the buffer pool is of size
- for each
$m-2$ pages in$R$ , we extract each tuple$t$ - for each page in S$, we extract each tuple
$u$ .- if
$t$ and$u$ satisfy$c$ , output$t$ and$u$ .
- if
- for each page in S$, we extract each tuple
The cost of this approach is
If the join predicate
- for each tuple
$t$ in$R$ - find the tuple
$u$ in$S$ with$u.b = t.c$ using index- output
$t$ and$u$ .
- output
- find the tuple
The cost of this approach is
An exterem case where
The fourth alternative is to sort merge join.
- Sort
$R$ by the attribute used in$c$ - Sort
$S$ by the attribute used in$c$ - Merge the sorted
$R$ and sorted$S$ like external sort
The cost of step 1 is
The cost of step 2 is
The cost of step 3 is
The fourth altenrative is to make use of a in memory hash table.
Assuming we hash the join attribute used in
- for each tuple
$u$ in$S$ - add
$u$ (projected attributes) to HT
- add
- for each tupel
$t$ in$R$ - lookup
$t$ 's attribute in HT- output
$t$ and the value in HT
- output
- lookup
The cost of this approach is
The Hash Join is perhaps impractical, as in most of the cases, the hash table can't fit in the RAM.
The assumption - there exists a hash function
-
$R$ can be partitioned into$n$ partitions by$h$ and$n \leq m - 1$ , and none of the partitions have more than$m-1$ pages, or -
$S$ can be partitioned into$n$ partitions by$h$ and$n \leq m - 1$ , and none of the partitions have more than$m-1$ pages.
The algorithm works as follows
- partition
$R$ and$S$ by$h$ - If
$S$ is the relation that satisfies that assumption- for each partition
$p_i$ in$S$ ,- load pages in
$p_i$ into the buffer pool, and build a in-memory hashtable using a different hash function$h_2$ - load data (page by page) from partition
$q_i$ in$R$ (values in$q_i$ and$p_i$ should share the same hash values of$h$ ) - look up the data in the hash table with
$h_2$ , output if a match is found.
- load pages in
- for each partition
The total cost of the Grace Hash Join is