Final project for CS1710: Model Checking for the k-means clustering algorithm
This project models the functioning of the k-means clustering algorithm using z3
. Our work also includes the design of a custom visualization script that can be used to generate graphical plots that make it easy to interpret instances generated by the model. Not only does our code accurately model the functioning of the k-means clustering algorithm, but it also allows us to easily run property verifications to confirm the existence of situations in which the algorithm produces undesirable results.
The model can be run from the terminal using the run.py
file. The command line interface is defined as follows:
usage: SMT solver: Property Verifications of the k-Means Clustering Algorithm
[-h] [-i NUM_ITERS] [-p NUM_POINTS] [-c NUM_CENTERS] [-g GRID_LIMIT]
[--random_centers] [-prop PROPERTY]
optional arguments:
-h, --help show this help message and exit
-i NUM_ITERS, --num_iters NUM_ITERS
Number of iterations for which to run the algorithm
-p NUM_POINTS, --num_points NUM_POINTS
Number of datapoints to generate
-c NUM_CENTERS, --num_centers NUM_CENTERS
Number of clusters (and therefore, the number of cluster
centers)
-g GRID_LIMIT, --grid_limit GRID_LIMIT
board dimension (each axis goes from -grid_limit to
grid_limit)
--random_centers flag indicating whether or not to random initialize centers
-prop PROPERTY, --property PROPERTY
which property to verify (if any); must be one of
{'EMPTY_CENTER_EACH_ITERATION',
'OVERLAP_CENTER_EACH_ITERATION', 'OVERLAP_CENTER',
'EMPTY_CENTER'}
Therefore, for example, if we wanted to run the model with 15 datapoints, 5 centers, 4 iterations, a grid limit of 6, using random center assignment, and checking for the property where there is an empty center (empty cluster) at the end of the algorithm, we can run this:
python run.py -i 4 -p 15 -c 5 -g 6 --random_centers -prop EMPTY_CENTER
Beyond simply modeling the functioning of the k-means clustering algorithm, our model also allows users to run property verifications on the algorithm. Some of the properties we checked (and were able to find instances in which these conditions were met) were:
EMPTY_CENTER_EACH_ITERATION
: in each iteration of the algorithm, there is at least one center (at least one cluster) which is assigned no datapointsEMPTY_CENTER
: by the end of the algorithm (after the final iteration), there is at least one center (at least one cluster) which is assigned no datapointsOVERLAP_CENTER_EACH_ITERATION
: in each iteration of the algorithm, there are at least 2 centers that are at the same location (overlap)OVERLAP_CENTER
: by the end of the algorithm (after the final iteration), there are at least 2 centers that are at the same location (overlap)
Our model is a faithful representation of the k-means clustering algorithm, meaning we did not make any particular abstraction choices when choosing how to model it as we were able to fully implement it. Because of the nature of z3
, our model sacrificed efficiency and practicality (i.e. it is not compatible with extremely large datasets that one would want to use/train k-means clustering on) for the sake of having the ability to do property verification. Additionally, we also traded off completeness for soundness due to further limitations of z3
. The model is unable to truly search the entire space of instances (meaning we cannot verify completeness) due to issues caused by attempting to perform the algorithm with exclusively z3
variables; however, this was a necessary choice in order for our model to be a sound implementation of k-means, as we were able to resolve it by manually saving the initial z3
variable assigments and performing the calculations based on those for the rest of the algorithm (this was necessary due to the need to compare z3
's ArithRef
values with pythonic int
s). Therefore, if our model generates an instance, that instance will be correct, however, if it says that some set of constraints is unsatisfiable, that isn't necessarily true because the space of instances has not been exhaustively searched.
For the sake of documentation, it is also worth noting that we attempted to implement this algorithm in Forge; however, we quickly realized that the algorithm's reliance on calculations and arithmetic in general meant we would be far more successful (and able to produce larger, more meaningful instances in reasonable amounts of time) if we pivoted to z3
. We have saved our (very limited) Forge work, and it can be found in old_work/kmeans.frg
.
Once an instance is produced by the model, we have code that parses all the variables into usable datasturctures, which are fed to our custom visualization script (in visualizer.py
). This code generates a sequence of graphical plots (one for each iteration of the algorithm) showing the positions of the centers and datapoints. Centers are labeled on the graph as c_{center_num}
and are represented with a light-grey color. Datapoints are labeled with just a single number, which corresponds to the number of the center (cluster) that they are assigned to. Furthermore, each datapoint that is assigned to the same cluster is given the same color (but points assigned to different centers have different colors), which makes it easy to interpret how points are assigned to different clusters (centers) across iterations.
We were able to accomplish all of our Foundation and Target goals set out at the beginning of the project, as we were able to produce a working model of the algorithm that is verifiably correct in its execution, and we built a visualizer that aligns with what our ideal visualization had been. We achieved our target goal of being able to vary over choices of both n (number of data points) and k (number of centers) using a random assignment of centers.
Our reach goals involved trying different kinds of center assignments and to predict how these assignment heuristics impact the algorithm's output. While we didn't quite get to this, we did build an interface through which initial center values can be manually provided. This is currently being used to randomly assign initial center values (since this is how the traditional algorithm operates), however, it can be repurposed to allow any kind of manual center assignment with some trivial changes.