By the end of this unit, you should be able to
- explain functional dependencies existing in data
- compute the closure of a set of functional dependencies
- compute the canonical cover of a set of functional dependencies
- identify candidate keys from a relation
One of the challenges in designing database (not DBMS) is to eliminate data anomalies.
The following are the most common kinds of data anomalies:
- update anomalies
- insert anomalies
- delete anomalies
From this point onwards, we treat the terms "relation" and "table" interchangable.
Let's recall the example of article-book-publisher
In the above we reuse the running example from the earlier class, with one adjustment. Your client suggests that
they want to include the date of publish whenever an article is published. Hence naturally you include a date attribute to the
publish
relationship.
Apply the same steps of ER-to-Relational translation, we have the following relations
- Article(id, name)
- Book(id,name)
- Publisher(id, name)
- Publish(article_id, book_id, publisher_id, date)
We create the tables based on the above relational model and load up the data.
Specifically the table Publish
looks like the following
article_id | book_id | publisher_id | date |
---|---|---|---|
a1 | b1 | p1 | 11/5/2019 |
a2 | b1 | p1 | 11/5/2019 |
a1 | b2 | p2 | 21/3/2020 |
Then we realized something awkward and verified its validity with the client. The date of publish is tied to the book being published. That means, we will have some duplicate dates in the Publish
table.
This sounds a bit unfortunate, we may shrug it off. However, something far more serious immediately follows.
- Suppose we would like to update the date of publish of a book, say
b1
, we need to update all entries inPublish
table withbook_id
asb1
. This is known as the update anomaly. - Suppose we would like to insert a new book to
Book
table. However, we have nowhere to set its date until we find anarticle_id
and apublisher_id
to insert that info into thePublish
table. This is known as the insert anomaly. - Suppose we would like to delete all entries from the
Publish
table withbook_id
asb1
. After that, we lose the publishing date ofb1
entirely. This is known as the delete anomaly.
In summary, in-proper relation model design results in data anomalies. To avoid that, we need a proper method to capture this kind of business requirement, i.e. the publish date is tied to a book.
Functional dependencies define a relationship among data in a relation. It is a representation of certain business requirement as we encountered in the earlier example.
Formally speaking, a functional dependency is a constraint between two sets of attributes.
Let
We refer
Given a FD
Let
Formally speaking, an FD
For example in the article-book-publisher example, for any tuples t1
and t2
in Publish
table, t1[book_id] = t2[book_id]
implies t1[date] = t2[date]
.
Since we are applying relation model to business problems. It is insufficient to observe and validate whether a FD holds in an instance of a relation (i.e. by observing the values in a table!). To verify the validity of FDs, we need to check with the domain expert, project stack holders and end users. In otherwords, FDs must come from the busines requirements, not be inferred from the data.
Given a relation
For instance, let
Some points to take note.
- For brevity, we omit the commas when there is no confusion, i.e.
$YZ$ is a short hand of$Y,Z$ . - Informally, we say
$F \equiv F'$ if$F$ and$F'$ impose the same set constraint on the values in$R$ .
Similarly, if we let
Now we consider finding the greatest superset
To compute
Let
For instance, {date
} book_id, date
}, thus we have book_id,date
date
Let
For instance, given book_id
date
, we have book_id,publisher_id
date,publisher
.
Let
Given the above three rules, can you prove
This derived rule is also known as the split rule.
- At the start, let
$F^+ = F$ . - Find pick one of the three Axioms to apply, to generate a FD, let's say
$X\rightarrow Y$ . Let$F^+ = F^+ \cup {X\rightarrow Y}$ . - repeat step 2 until
$F^+$ does not change any more.
For example, given
step | new FD | rule |
---|---|---|
1 | Ref | |
2 | Ref | |
3 | Ref | |
4 | Ref | |
5 | Ref | |
6 | Ref | |
7 | Ref | |
8 | Ref | |
9 | Ref | |
10 | Ref | |
11 | Ref | |
12 | Ref | |
13 | Ref | |
14 | Ref | |
15 | Ref | |
16 | Ref | |
17 | Ref | |
18 | Ref | |
19 | Ref | |
20 | Ref | |
21 | Ref | |
22 | Ref | |
23 | Ref | |
24 | Ref | |
25 | Ref | |
25 | Ref |
No other rules are applicable, then we are done.
A canonical cover of
The above statement is intuitive but not precise. To be precise, we need to define the standard form of FDs.
An FD is in standard form iff its RHS is a single attribute.
It follows that for any set of FDs, we can convert it into an equivalent set with standard form FDs. (Hint: we know
Let
- All FDs in
$F_c$ are in standard form; and -
$F_c^+ \subseteq F^+ \wedge F^+ \subseteq F_c^+$ ; and -
$\neg \exists G \subset F_c$ such that$G^+ \subseteq F^+ \wedge F^+ \subseteq G^+$
- Convert
$F$ to standard form. - Minimize the lhs of each FD, by applying Reflexitivity, Augmentation and Transitivity.
- Remove redundant FDs, by applying Reflexitivity, Augmentation and Transitivity.
For example, consider
First applying split rule to (2)
Then we apply Augmentation and Transitivity rules to (1) and (2) to minimize LHS of rule (1)
Now we find that (2') is a duplicate of (1) and (4) is a duplicate of (2).
Finally we find that (1) is derivable by applying transitivity to (2) and (3).
The above is minimal.
Note that the algorithm described above is non-confluent, i.e. dpending on the order of FDs being picked a different canonical cover might be generated.
Canonical cover is very useful. We can use it to reduce the number of constraints (which is expensive to verified). We leverage on Canonical cover to identify candidate key for a relation.
In database, we call a set of attribute of a relation as a
- Super key if it functionally determines all other attributes
- Candidate key if it is a minimal set of attributes that functionally determines all other attributes.
- Primary key if it is one of the candidate key. (We just fix one.)