We'll be using a power consumption dataset to explore some clustering techniques.
Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.
This archive contains 2075259 measurements gathered in a house located in Sceaux (7km of Paris, France) between December 2006 and November 2010 (47 months). Notes:
-
(
global_active_power
*1000/60 -sub_metering_1
-sub_metering_2
-sub_metering_3
) represents the active energy consumed every minute (in watt hour) in the household by electrical equipment not measured in sub-meterings 1, 2 and 3. -
The dataset contains some missing values in the measurements (nearly 1,25% of the rows). All calendar timestamps are present in the dataset but for some timestamps, the measurement values are missing: a missing value is represented by the absence of value between two consecutive semi-colon attribute separators. For instance, the dataset shows missing values on April 28, 2007.
date
: Date in formatdd/mm/yyyy
time
: time in formathh:mm:ss
global_active_power
: household global minute-averaged active power (in kilowatt)global_reactive_power
: household global minute-averaged reactive power (in kilowatt)voltage
: minute-averaged voltage (in volt)global_intensity
: household global minute-averaged current intensity (in ampere)sub_metering_1
: energy sub-metering No. 1 (in watt-hour of active energy). It corresponds to the kitchen, containing mainly a dishwasher, an oven and a microwave (hot plates are not electric but gas powered).sub_metering_2
: energy sub-metering No. 2 (in watt-hour of active energy). It corresponds to the laundry room, containing a washing-machine, a tumble-drier, a refrigerator and a light.sub_metering_3
: energy sub-metering No. 3 (in watt-hour of active energy). It corresponds to an electric water-heater and an air-conditioner.
There are many different clustering methods that could lead to different outputs, depending on the characteristics of the data distribution. They have different function parameters and some need to know the number of clusters beforehand when others don't.
Also, there are no wrong answers when we're talking about clustering. It really depends on the goal of the analysis and on the business side of it. Sometimes the business experts will tell you the number of clusters they're after, or even identify and explain the clusters that your analysis unveiled.