By the end of this unit, you should be able to
- apply lossless decomposition to a relation
- verify a relation is in 1NF/2NF/BCNF/3NF
- decompose a relation into 2NF/BCNF/3NF
By now, we know that having an FD in which the LHS of the constraint is not a the primary key of the relation imposes data anomalies.
Recall the relation Publish(article_id, book_id, publisher_id, date)
with the fillowing data
article_id | book_id | publisher_id | date |
---|---|---|---|
a1 | b1 | p1 | 11/5/2019 |
a2 | b1 | p1 | 11/5/2019 |
a1 | b2 | p2 | 21/3/2020 |
There exists an FD book_id book\_id
is not the primary key (though it is part of the primary key).
To fix the issue, we need to decompose publish
into two smaller relations. But how?
The idea is to decompose it based on the FD. From the FD, we find that book_id determines date. We should move book_id and date into another relation. At the same time we should keep the book_id in publish
so that the relationship between author_id, book_id and publisher_id is not lost.
Give a relation
To ensure a decomposition is lossless, we pick an FD constraint
For instance, in the Publish
relation above, we may decompose it by the FD book_id
- Publish1(article_id, book_id, publisher_id)
- Publish2(book_id, date)
article_id | book_id | publisher_id |
---|---|---|
a1 | b1 | p1 |
a2 | b1 | p1 |
a1 | b2 | p2 |
book_id | date |
---|---|
b1 | 11/5/2019 |
b2 | 21/3/2020 |
Note that we eliminate the data anomalies. (Eventually, we might merge Publish2
with the Book
relation, which is a seperate topic.)
The next question we need to consider is how far should we decompose relation?
Normal forms define a set of criteria which allows us to check whether the result of decomposition is good enough.
A relation is in 1NF iff its schema is flat, (i.e. contains no sub-structure) and there is no repeating group (i.e. there is no repeating column).
For example the following relations are not in 1NF
student_id | name | phones |
---|---|---|
1234 | Peter Parker | [95598221, 82335354] |
This relation's schema is not flat.
student_id | name | phone1 | phone2 |
---|---|---|---|
1234 | Peter Parker | 95598221 | 82335354 |
This relation has a set of repeating columns, phone1
, phone2
. (Though in reality, we could be lenient here, maybe we could rename it to primary contact
, secondary contact
.)
A relation is in 2NF iff
- it is in 1NF and
- all non-key attributes are fully dependent on candidate key.
In other words, the relation is at least 1NF and there should be no partial dependency.
For example, in the running example
Publish(article_id, book_id, publisher_id, date)
is in 1NF but not in 2NF, because the attribute date
is not fully dependent on the primary key article_id,book_id
. It is partially dependent on book_id
.
Given a relation
An FD is trivial iff its lhs is a superset of the rhs.
For example,
- Publish1(article_id, book_id, publisher_id)
- Publish2(book_id, date)
are in BCNF, because the only non trial FDs are
-
article_id,book_id
$\rightarrow$ publisher_id
-
article_id,publisher_id
$\rightarrow$ book_id
(recall the ER diagram) -
book_id
$\rightarrow$ date
.
Note that FD #2 does not violate the BCNF requirement, because article_id,publisher_id
is a candidate key of Publish1
hence also a super key.
The proof is omitted. You are encouraged to try proving it.
Given a relation
- Compute
$F^+$ - Let
$Result = {R}$ - While
$R_i \in Result$ not in BCNF, do 3.1. Choose$X\rightarrow Y \in F^+$ such that$X$ and$Y$ are attribtues in$R_i$ but$X$ is not a super key of$R_i$ . 3.2. Decompose$R_i$ into$R_{i1}$ $R_{i2}$ with$X\rightarrow Y$ . 3.3. Update$Result = Result - { R_i} \cup { R_{i1}, R_{i2} }$
- def
$normalize(R)$ - Let
$C = attr(R)$ - find an attribute set
$X$ such that$X^+ \neq X$ and$X^+ \neq C$ .- if
$X$ is not found, then$R$ is in BCNF - else
- decompose
$R$ into$R_1(X^+)$ and$R_2(C-X^+ \cup X)$ $normalize(R_1)$ $normalize(R_2)$
- decompose
- if
- Let
$normalize(R)$
Consider
First we find all attribute closures.
$A^+ = AD$ $B^+ = B$ $C^+ = CB$ $D^+ = D$ $AB^+ = ABCD$ $ABC^+ = ABCD$ - ...
We find that
$AB$ is a candidate key of $$.
At step 1.2, we found
$R_1(A,D)$ $R_2(A,B,C)$
$R_{21}(B,C)$ $R_{22}(A,B)$
Then we are done.
Given a relation
-
$X$ is a super key or -
$Y$ is part of a candidate key
The following diagram shows some example
In the first diagram,
In the second diagram,
It can be proven from by the definitions.
BCNF is easier to compute, we just keep finding a FD that violates the definition and keep decomposing until none is found.
Though BCNF decomposition is lossless, it is not dependency preserving.
A FD set
Recall the previous example
Applying BCNF-decomposition will yield
With that difference in mind, we present the algorithm to compute 3NF as folows.
- Apply the BCNF algorithm to decompose
$R$ , let's say the result is a set of relations$R_1, ..., R_n$ . - Let
$F_1,...,F_n$ be the list of FDs preserved by$R_1, ..., R_n$ . - Compute
$(F_1 \cup ... \cup F_n)^{+}$ . Let$\bar{F} = F - (F_1 \cup ... \cup F_n)^{+}$ . - For each
$X_1...,X_n\rightarrow Y \in \bar{F}$ , create a new relation$R'(X_1,...,X_n,Y)$
For example, recall the previous example
After the BCNF decomposition, we realize
Alternatively, we could have used the BCNF algorithm but do not decompose