By this end of this unit, you should be able to
- Describe the different layout options of a Heap Page
- Describe the different layout options of a Heap File
- Explain the purposes of a database index.
- Explain the different collision handling mechanisms of Hash Table
- Explain the structure of a B+ Tree
- Explain the lookup / insert / delete operations over a B+ Tree.
- Describe the difference between a clustered B+ Tree index and a unclustered B+ Tree index.
As discussed in the earlier classes, through SQL (and relational algebra), we scan data in a table, search for data based on filtering condition, update data, insert and delete data.
We consider the actual implementation of these access methods.
Like any other software implementation, we need to consider the data structure together with the algorithm.
Recall inside a data page, we find
- a set of tuples/records
- a header
- the index
- the log
Let's consider the first two components.
There are few options of how tuples are organized in a page.
Tuples are stored inside a page, one after another, like a fixed size array. Elements inside the array, i.e. the pages, are unordered. We often call this page structue as a Heap Page.
One advantage is that sequential layout is easy to implement, especially for sequential scanning, and insertion.
The clear disadvantage is that after some tuples being deleted
- we see holes in between tuples
- the space utilization is unclear
In this version we improve over the previous version of Heap Page by including a header. The header is an array of bits whose size equals to the maximum number of tuples in a page (with the meta-data discounted). The i-th bit is 0 implies the i-th tuple slot is unused, otherwise, the slot is used.
With this approach we address the deletion issue. To insert a tuple, we need to scan through the header array to look for free slot.
There are still some issues to be addressed.
- we have to assume that tuples are all in a single fixed size
- it is hard to search for a particular tuple except sequential scan.
The first issue can be addressed by allowing long tuple spanning across multiple slots. We need to store extra header information.
The second issue will be addressed altogther when we consider the internal layout of a database file
From the earlier section, we note that a database file stores data belonging to a particular table. In the database file, we find a collection of pages.
The issues related to layout of database files are similar to those with pages. We have a few options.
Simple approach. We store one page after another. The good news is that all pages are of the same size.
A similar issue with this approach is that it is not easy to find pages with free space.
In this approach, we maintain two linked lists in the file as meta data. One of the list stores all pages which are full. The other one stores all pages which are not yet full.
An alternative to the linked lists approach is to use a set of directories. Each directory contains a fixed set of slots. Each slot contains a pointer to (i.e. the location of) a page and a freespace bit. Directory slot may also point to another directory if expansion is required. Directories are small in size so that they could fit in the memory. As a file grows with more pages, a new directory could be created just like a data page.
In the event that we would like to keep track of how much free space per page has, we could change the freespace bit into a freespace integer field. This change supports the need of having variable length tuple.
Another alternative we could consider is that instead of storing data as tuple, we could store them based on column, which is known as columnar storage. We omit the detail here and defer the discuss during the cohort lesson.
We have consider the internal layout of a page and the internal layout of a file. To maintain the free/occupied slots information, we maintain the meta data in pages, as well as files.
It is clear that storing pages sequentially in a file has certain advantage in particular for sequential scan. The following table sumarizes the average cost of data operation performed on data file with sequential layout and one with random layout. Let
sequential | random | |
---|---|---|
Insert a record | ||
Lookup a record | ||
Insert a record |
One issue yet to be addressed is to enable searching for a tuple without resort to sequential scanning which is expensive.
A way to resolve to search issue with Heap file is to incorporate indices. For instance, we want to search for books that are publish in the year of 2019.
An index maps values of the searching criteria into a set of record ids. Each record id stores the exact location of a tuple, i.e. page id and tuple slot number.
A table may have multiple indices.
The idea of indices is similar to adding labels to a thick phyiscal text book. The labels enable us to quickly "jump" to a section of interest.
There many different options of index.
A hash table maps values of searching criteria into the target set via a hash function. The input to the hash function is the value, and the output is the set of locations of the records.
The perfect hash function should be
- efficient - takes no time to compute the hash
- low (or better to be no) collision. For all
$x_1 \neq x_2$ ,$hash(x_1) \neq hash(x_2)$ .
In reality, collision can't be eliminated.
To deal with collision, there are two approaches on the storage level
- Open Hashing (a.k.a. separate chaining) we store collided objects in linked-list pointed by the same key. In the above diagram the values 1000 and 9530 are hashed into the same cell of the hash table, hence 9530 is stored in a chain.
- Closed Hashing (a.k.a. open addressing) we try to store collided objects in the same hash table.
We discuss a few more alternatives of closed hasing
Linear Probing is a close hashing collision resolution. The idea is that if both values are hashed to the same cell of the hash table, one of them will be stored in the next available slot.
In the diagram above, both values A and C are hased into the same cell of the hash table. When we try to insert C, we have to place C to the cell available right after the cell where A resides.
First of all, for this to work, we need to know the max size of the table, in case the current collided cell is the last element of the table. Secondly the search time of the value given a key, is linear to the size of the hash table in the worst case.
This hashing technique is given this name because it behaves like how parent birds feed their chicks. They often have multiple young chicks in the same nest. When feeding, they often forget which one had been fed before, so they just keep trying one by one.
The cuckoo hashing operates with multiple hash functions (
- Give a value
$v$ to be stored, we hash it with all the hash functions. - Given the hashed values
$cell_1,...,cell_n$ , we search for the first$cell_i$ such that$cell_i$ in$table_i$ is free.- If found, we put
$v$ at$cell_i$ in$table_i$ . - otherwise, we pick one
$cell_j$ which is currently occupied by value$u$ .- we replace
$u$ by$v$ - we go to look for a new cell for
$u$ . (recursive call)
- we replace
- If found, we put
We illustrate the idea using the following example. Let
- Step one we would like to insert
$A$ . We put$A$ in table 1 at cell$H_1(A)$ . - Step two we want to insert
$B$ . We put$B$ in table 2 at cell$H_2(B)$ , because$H_1(B)$ collides with$H_1(A)$ . - Step three we want to insert
$C$ . We realized that both$H_1(C)$ and$H_2(C)$ are occupied due to collision. We vacate$B$ and insert$C$ . We need to re-insert$B$ . We put it at cell$H_1(B)$ , and vacate$A$ . We re-insert$A$ at$H_2(A)$ . The all values are in the hash table(s).
The advantage of this approach is that the lookup cost is always
One issue with Cuckoo Hashing is that eventually, we need to rehash everything.
The idea behind Bucket Hashing (a.k.a extensible hashing) is to
- store hashed values in buckets (relative small fixed size arrays, so that the sequential search is not expensive). All buckets have the same size.
- use the
$n$ least significant bits (LSB) of the hashed value to decide in which bucket it should be placed. - increase
$n$ and add new bucket gradually as some bucket becomes full.
The Bucket Hashing algorithm starts with a global slot array, a bucket.
It maintains a set of integer values of
For simplicity, we treat
- lookup the bucket for
$X$ based on the last$G$ bits of$X$ .- if the bucket
$i$ is not full, insert$X$ there. - otherwise
- if the bucket
$i$ having$L < G$ - add a new bucket
$j$ having same$L$ as$i$ , - redistribute data from
$i$ to$i$ and$j$ . - increase
$L$ for both buckets$i$ and$j$ . - update the slot array
- add a new bucket
- otherwise
- double the slot array
- add a new bucket
$j$ having same$L$ as$i$ - redistribute data from
$i$ to$i$ and$j$ . - increase
$L$ for both buckets$i$ and$j$ . - increase
$G$ - update the slot array
- if the bucket
- if the bucket
For example, we start with the following empty table
First, we insert hashed values 00000
, 01100
and 11010
.
Next, we insert a hashed value 10001
, but the bucket is full. We need to create a new bucket with
In the third step, we insert hashed values 01011
and 01111
.
Both values go to the new bucket
In the fourth step, we insert another value 01110
. Both the bucket store the 1-LSB as 0
is full. We add double the slot array, add a new bucket with
To lookup a value
- lookup slot array given
$G$ LSB of (hash of)$X$ .- if the bucket is found
- scan through the bucket for
$X$ sequentually
- scan through the bucket for
- otherwise, report not found
- if the bucket is found
The insertion time and lookup time are both
One disadvantage of Hash table is that it only supports point query. For example, it works well with query
select * from book where book_id = 'b1';
because we could hash book_id
.
Hashing does not help to speed up range search query
select * from book where date >= '2019-01-01' and date <= '2019-12-31';
Even if we hash all date
, there is no guarantee the values are store in the same bucket/page/block.
A binary search tree is a binary tree in which all nodes from left sub-tree are smaller than the current node's value and all nodes from right sub-tree are larger than the current node's value.
graph
N0("10")-->N1("7")
N0("10")-->N2("20")
N1("7")-->N3("5")
N1("7")-->N4("9")
N2("20")-->N5("15")
N2("20")-->N6("26")
N3("5")-->N7("1")
N3("5")-->N8("6")
A binary tree is balanced iff for any non-leaf node, n, the height difference between the left and the right sub-trees should not exceed 1.
A binary tree is perfectly balanced iff for any non-leaf node, n, the height difference between the left and the right sub-trees should be 0.
A B+ Tree is a generalization of a perfect balanced binary search tree, with the following adjustment
- Let
$d$ denotes the order. - The root node may have
$1$ to$2*d$ entries. - Each non-root node may have
$d$ to$2*d$ entries (half-full). - The each in a node consist of a key and a value, except for the first entry i.e.
$v_1, (k_2, v_2), (k_2, v_2), ... , (k_{n}, v_n)$ . - The values in non-leaf nodes are reference to the children.
- The values in the leaf nodes are reference to records in the data file/page/block.
- Given two consecutive entries
$(k_i, v_i), (k_{i+1}, v_{i+1})$ , where$v_{i}$ is a reference to a subtree, for all values in subtree, their index values must be in between$k_i$ and$k_{i+1}$ . For the first entry, its lower bound is definedby the key coming from the parent. (Similar observation applies to the last entry.) - All the leaf nodes for a doublely linked list, which enables the range search operation.
For example, the following figure shows a B+Tree with
To search for a value with key
- find the value
$v$ between$k_1$ and$k_2$ such that$k_1 \leq k < k_2$ . Note that$k_1$ might not exist if$k_2$ is the first entry's key,$k_2$ might not exist if$k_1$ is the last entry's key.- if the current node is a non-leaf node, we move the current node to the node pointed by
$v$ , we repeat the process recursively - if the current node is a leaf node, we return the disk data pointed by
$v$ .
- if the current node is a non-leaf node, we move the current node to the node pointed by
It is clear that the time complexity is
To insert a value with key
- find the leaf node
$L$ with the current interval where$k$ should fit, (same as the look-up operation). - insert_to_leaf(
$L$ ,$k$ ) - def insert_to_leaf(
$L$ ,k)- if
$L$ is not full, just insert, we are done! - otherwise
- create a new node
$L'$ . - (update linked list, only for leaf node), let
$L'' = succ(L)$ , then$succ(L) = L'$ and$succ(L') = L''$ . - redistribute entries evenly between
$L$ and$L'$ . - copy up the middle key, with the value as a pointer to
$L'$ , that is to insert a new data entry in the parent node. insert_to_nonleaf(parent($L$ ), middle_key)
- create a new node
- if
- def insert_to_nonleaf(
$N$ , k)- if
$N$ is not full, just insert, we are done! - otherwise
- create a ne wnode
$N'$ . - redistribute entries evenly between
$N$ and$N'$ . - push up (note, not copy) the middle key, with the value as a pointer to
$N'$ .- if
$N$ is not a root node, insert_to_nonleaf(parent(N), middle_key) - otherwise, create an empty node as the new root node, insert_to_nonleaf(parent(N), middle_key)
- if
- create a ne wnode
- if
For example, we would like to insert the entry 8
into the following B+ Tree with 8
.
We locate the leaf node. However it is full, it has 4 entries,
We copy up the middle key to the parent. Now we need to insert 5
to the root node.
The root node is full, we must create a new root child node, redistribute the entries, and create a new root node and push the middle key 17
to the new root node.
The time complexity of the insertion algorithm is
Given a node
To delete a value with key
-
find the leaf node
$L$ with the current interval where$k$ should fit, (same as the look-up operation). -
delete_from_leaf(
$L$ ,$k$ ) -
def delete_from_leaf(
$L$ ,$k$ )- remove the entry with
$k$ - if
$L$ is at least half-full (i.e.$|L -{k}| \geq d$ ), we are done! - otherwise
- if
$L$ has a sibling$L'$ , and$k'$ is the key from the parent that divides$L$ and$L'$ , such that$|L \cup {k'} \cup L'-{k}| \geq 2*d$ - find the new middle key in
$L \cup {k'} \cup L'-{k}$ , say$k''$ , replace$k'$ by$k''$ in$parent(L)$ - if
$|L \cup {k'} \cup L'-{k}|-1 \geq 2*d$ - re-distribute
$L \cup {k'} \cup L' - {k,k''}$ among the two leaf nodes. - otherwise, re-distribute
$L \cup {k'} \cup L'- {k}$ among the two leaf nodes, i.e.$k''$ is copied up.
- re-distribute
- find the new middle key in
- otherwise
- merge
$L$ with its sibling$L'$ into a new node as$L \cup {k'} \cup L' - {k}$ , where$k'$ is the key from the parent dividing$L$ and$L'$ . - delete_from_nonleaf(
$parent(L)$ ,$k'$ )
- merge
- if
- if
- remove the entry with
-
def delete_from_nonleaf(
$N$ ,$k$ )- remove the entry with
$k$ - if
$N$ is a root node and$|N -{k}| > 0$ , we are done! - if
$N$ is a root node and$|N -{k}| == 0$ , we remove$N$ entirely. - if
$N$ is not a root node and$N$ is at least half full, we are done. - otherwise
- if
$N$ has a sibling$N'$ , and$k'$ is the key from the parent that divides$N$ and$N'$ , such that$|N \cup N' - {k}| \geq 2*d$ ,- find the new middle key in
$N \cup {k'} \cup N'$ , say$k''$ , replace$k'$ by$k''$ in$parent(N)$ , redistribute$|N \cup N' - {k}|$ among the two nodes.
- find the new middle key in
- otherwise
- merge
$N$ with its sibling$N'$ into a new node as$N \cup {k'} \cup N' - {k}$ , where$k'$ is the key from the parent dividing$N$ and$N'$ . - delete_from_nonleaf(
$parent(N)$ ,$k'$ )
- merge
- if
- remove the entry with
For example, continueing from what we have from the previous example,
We delete the key 22
. Since the node is still half full, not merging or redistribution occurs.
Next we delete key 20
, which makes the leaf node no longer half-full. We borrow entry 24
from the sibling on the right. The key 24
from the parent node is replaced by the new middle key 27
.
In the third step, we delete key 24
. Merging the remaining entry 19
with 27
,29
into a single leaf node.
As a consequence, we need to delete 27
from the parent node.
The merging effect cascade up, since the non-leaf node containing 30
is not half-full.
The time complexity of deletion is also
A B+ Tree is clustered iff data in the heap file are sorted according to the index attribute (keys in the B+ Tree).
Due to its physical storage nature, there is only one clustered index per table. We may have many unclustered indices per table.
Clustered index in general out-performs unclustered indices when being applied in range queries.