Skip to content
This repository has been archived by the owner on Sep 26, 2023. It is now read-only.

different runs of k-means clustering result in different outputs #9

Open
ghost opened this issue Feb 25, 2015 · 6 comments
Open

different runs of k-means clustering result in different outputs #9

ghost opened this issue Feb 25, 2015 · 6 comments

Comments

@ghost
Copy link

ghost commented Feb 25, 2015

var colors = [
   [97],
   [1],
   [53],
   [79],
   [3],
   [351],
   [16]
];

var clusters = clusterfck.kmeans(colors, 3);

Result A: [1, 3, 16], [53, 79, 97], [351]
Result B: [1, 3, 16, 53], [79, 97], [351]

@bbroeksema
Copy link

That's normal, kmeans places the initial seeds (cluster centers) randomly. So each run will have a different initial set of seed locations, and as such (slightly) different outcomes. See for a nice introduction to k-means and clustering: http://web.cs.sunyit.edu/~mike/cs542/Jain50YearsBeyondKmeans.pdf

@ghost
Copy link
Author

ghost commented Mar 2, 2015

Thanks for the literature. However, this behaviour should be explicitly mentioned somewhere, because in other tools (i.e., R, Weka) the default k-means implementation can handle such cases.

@Ouwen
Copy link

Ouwen commented Mar 2, 2015

How does R and Weka handle it? Do they use the same random seed for each run?

@bbroeksema
Copy link

In R you can pass "centers" which is either the number of clusters (which will result in similar undeterministic behavior) or actual initial, distinct, cluster centers (in case, I believe but not actually checked, it will behave deterministic). I don't know about weka.

@user24
Copy link

user24 commented Jun 10, 2015

You could modify the kmeans function so instead of saying this.centroids = this.randomCentroids(...) you could pass the centroids in as an argument. That should allow different runs to produce the same results.

@tayden
Copy link

tayden commented Feb 11, 2016

Often K Means is run multiple times and there is an error measurement calculated as the mean square distance of each point to the cluster centroid to which it belongs. You can then use the clustering result that minimizes this error as your centroids.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants