Skip to content

Latest commit

 

History

History
193 lines (140 loc) · 8.77 KB

README.md

File metadata and controls

193 lines (140 loc) · 8.77 KB

Graph k-colouring using MIMIC

This project is optimized for maximum readability and not performance. This is reinventing the wheel for educational purposes.

There are comments & docstrings throughout the code.

No external libraries are used, except the Fibonacci Heap for finding MST.

The following is an explanation of the algorithm. you can read it, jump straight into the code or read these

original paper

useful link

Wikipedia

Table of Contents
  1. ED Algorithms
  2. MIMIC's Modeling
  3. Eliciting Empirical Probabilities
  4. Sampling

ED Algorithms

Generally, given some cost function C, we want to find x with minimal C(x).

E.g. For the k-Colouring Problem, I have defined the cost function to be the number of bad edges, i.e. edges connecting same-colour vertices.

(Alternatively, we can maximize a utility function. without loss of generality, we will stick to the first problem description)

A model (density estimator) represents where we think the minimal C(x) is. At first, we don't know anything about C, so the model should represent a uniform distribution over all admissible solutions. We use the model to generate new candidate solutions (xs) by calling model.sample(). We evaluate C(x) to see if our guess was right or wrong. We then improve our guessing iteratively, learning about the distribution of C.

At each iteration:

  1. We use this model to generate more samples and grow our population (using β_μ).
  2. Candidate solutions with high (bad) C values are discarded and the fittest ones survive (using S).
  3. With this new population, our model could be updated to better estimate the distribution (using α_MIMIC).

This idea of guessing intelligently improves performance compared to Simulated Annealing and other alternatives which blindly walk on C.

Each EDA has a unique way of modeling with its own pros and cons.

MIMIC Modeling

Learning the structure of the function space helps us to better estimate, sample, and model C.

For example, in a 2-colouring problem, imagine a graph with only two neighbouring vertices. Possible valid solutions are 0-1 and 1-0, meaning that each vertex has an equal possibility to be coloured 0 and 1. A modeling strategy without learning the relationships would sample 0-0 and 0-1 with equal possibilities.

We want a good modeling strategy that learns about the structure of x's variables (x = [x_0, x_1]), e.g. knowing x_0 = 0, the possibility of sampling x_1 = 0 should decrease drastically.

helpful read: no free lunch theorem in search and optimization - an explanation of why optimizations can only improve on others if they build in certain assumptions about structure

But learning too much structure would cost us too much time, i.e. we will reach the goal in fewer iterations but each iteration would cost more time. It's all about balancing the two.

Learning the structure perfectly would generate the joint probability distribution like this:

p(x) = p(x_0)*p(x_1 | x_0)*p(x_2 | x_1, x_0)*p(x_3 | x_2, x_1, x_0) ...

Which means: the probability of x1 knowing that x_0 is whatever it is, the probability of x2 knowing that x_1, x_0 are whatever they are, the probability of x3 knowing that x_2, x_1, x_0 are whatever they are, and so on and so forth.

But this increases our computational complexity exponentially as len(x) grows

MIMIC solves this problem by limiting the learning, sacrificing perfect structure for performance.

MIMIC computes only pairwise conditional probabilities (using the current population), i.e. p(x_2 | x_1) instead of p(x_2 | x_1, x_0) and p(x_3 | x_2) instead of p(x_3 | x_2, x_1, x_0).

p(x) = p(x_0)*p(x_1 | x_0)*p(x_2 | x_1)*p(x_3 | x_2) ...

This decreases the accuracy of our estimation. the amount of decrease is rigorously measured by Kullback–Leibler divergence.

The divergence between the perfect joint probability distribution and MIMIC's estimation is inevitable, but we will try to minimize this divergence by finding the best permutation.

The Permutation

The permutation is the limited structure that our algorithm learns.

Some variables have more impact on some than on others.

In the k-Colouring Problem, neighbouring vertices are intuitively more impactful on each other. If one is blue, the other is definitely not blue. Mutually independent variables have zero impact on each other.

The idea is to compute p(x_i | x_j)s in an order in which x_j has the most impact on x_i.

so, instead of computing

p(x) = p(x_0)*p(x_1 | x_0)*p(x_2 | x_1)*p(x_3 | x_2) ...

we can compute (for example)

p(x) = p(x_0)*p(x_2 | x_0)*p(x_1 | x_2)*p(x_3 | x_2) ...

Given that x_0 and x_2 are more dependent on each other, this would increase the accuracy of our estimation.

The amount of impact two variables have on each other is rigorously measured by mutual information. It is a measure of the mutual dependence between the two variables. In other words, it quantifies the amount of information obtained about one random variable by observing the other random variable.

The Tree

What's better than a path/chain? A Tree!

What if x_0 is so impactful that we would be better with just

p(x) = p(x_0)*p(x_1 | x_0)*p(x_2 | x_0)*p(x_3 | x_0) ...

This would be a star in a Dependency Tree.

Dependency Trees are Bayesian Networks or Dependency Graphs where every node has exactly one parent (except the root, which has None).

This improves our estimation further without any loss in performance.

Instead of computing the best permutation, I have chosen to compute the best tree. It may be hard to believe, but it was easier for me to implement it.

Wrapping up

Mutual Information Maximizing (MIM):

To build a model from a sample population, we first create a Mutual Information Graph gmi: A complete graph with each edge weighting its ends' mutual information. We then compute the best tree by finding the Maximum Spanning Tree of gmi. Maximizing M.I. ≡ minimizing entropy ≡ minimizing KL divergence

Eliciting Empirical Probabilities

In the process of model-building, we must deduce the empirical probabilities for each variable x_i.

This can be easily computed as follows:

p(x_i = c) is the percentage of xs in our population with x_i = c

Sampling

Generating samples from our model (Dependency Tree) is as you would think.

We choose a colour for the root based on its empirical probability p(root).

For every other node v, we choose a colour based on its empirical conditional probability p(v | parent(v)).

note that parent(v)'s colour should be chosen before v itself. This constraint is satisfied with BFS, DFS, and a lot of other tree traversal algorithms. I chose DFS for the simplicity of implementation.