ML.NET version | API type | Status | App Type | Data type | Scenario | ML Task | Algorithms |
---|---|---|---|---|---|---|---|
v0.11 | Dynamic API | Up-to-date | WinForms app | .csv files | Spike and Change Point Detection of Shampoo Sales | Anomaly Detection | IID Spike Detection and IID Change point Detection |
Shampoo Sales Anomaly Detection is a simple application which builds and consumes time series anomaly detection models to detect spikes and change points in shampoo sales.
This is an end-to-end sample which shows how you can use ML.NET and anomaly detection in a WinForms application.
Note: This app is written in .NET Framework, so you must manually restore the nuget packages before running the app.
-
WinForms App:
- Prompts the user to input a dataset file for anomaly detection (in this case we have provided
shampoo-sales.csv
that you can use) - Prompts the user to indicate if the data in the file is separated by commas or tabs
- Prompts the user to indicate if they want to see spikes or change points in the data
- Displays the data in a table format so that the user can inspect the data columns
- Displays the data as a time series line graph
- Loads the trained spike detection and change point detection models
- Uses the trained models to detect and display the anomalies both in a textual format, as markers in the line graph, and as highlighted rows in the data table
- Prompts the user to input a dataset file for anomaly detection (in this case we have provided
-
Time Series Anomaly Detection Console App
- Builds and trains a time series anomaly detection model using the Shampoo Sales dataset for both spike detection and change point detection.
- Uses confidence level and p-value as algorithm hyperparameters.
- Uses IidSpikeDetector and IidChangePointDetector.
The shampoo-sales.csv
dataset is from DataMart.
This problem is focused on finding spikes and change points in shampoo sales over a 3 year period, which can then be helpful in analyzing trends or abnormal behavior in sales.
To solve this problem, we will build an ML model that takes as inputs:
- Date (Year 1 - 3 and Month)
- Number of shampoo sales
and will generate an alert if/where a spike or change point in shampoo sales is detected.
Anomaly detection is the process of detecting outliers in the data. Anomaly detection in time series refers to detecting time stamps, or points on a given input time series, at which the time series behaves differently from what was expected. These deviations are typically indicative of some events of interest in the problem domain: a cyber-attack on user accounts, power outage, bursting RPS on a server, memory leak, etc.
An anomalous behavior can be either persistent over time or just a temporary burst. There are 2 types of anomalies in this context: spikes and change points.
Spikes are attributed to sudden yet temporary bursts in the values of the input time-series. In practice, they can happen due to a variety of reasons depending on the application: outages, cyber-attacks, viral web content, etc.
Change points mark the beginning of more persistent deviations in the behavior of time-series from what was expected. In practice, these type of changes are usually triggered by some fundamental changes in the dynamics of the system. For example, in system telemetry monitoring, an introduction of a memory leak can cause a (slow) trend in the time series of memory usage after certain point in time.
To solve this problem, in your console app you build and train two ML models on existing data (shampoo sales) to demonstrate time series anomaly detection. You then use the model in the WinForms app, where the Prediction output columns provide the Alerts where the models predicted the anomalies (spikes or change points in shampoo sales) to be in the dataset.
The process of building and training models is the same for spike detection and change point detection; the main difference is the algorithm that you use (IidSpikeDetector
vs. IidChangePointDetector
).
Building a model in the console app includes:
-
Preparing and loading the data from (
shampoo-sales.csv
) to an IDataView. -
Creating an Estimator by choosing a trainer/learning algorithm (e.g.
IidSpikeDetector
orIidChangePointDetector
) and setting parameters (in this case confidence level and p-value).
The initial code for Spike Detection is similar to the following:
// Create MLContext object
var mlcontext = new MLContext();
// STEP 1: Common data loading configuration
IDataView dataView = mlcontext.Data.LoadFromTextFile<AnomalyExample>(path: filePath, hasHeader:true, separatorChar: commaSeparatedRadio.Checked ? ',' : '\t');
// Step 2: Set up the training algorithm
string outputColumnName = nameof(AnomalyPrediction.Prediction);
string inputColumnName = nameof(AnomalyExample.numReported);
var trainingPipeline = mlcontext.Transforms.IidSpikeEstimator(outputColumnName, inputColumnName, confidenceLevel, pValue);
Training the model is a process of running the chosen algorithm on a training data (with known anomaly values) to tune the parameters of the model. It is implemented in the Fit()
API.
To perform training in the console app, you just call the Fit()
method while providing the training dataset (shampoo-sales.csv
file) in a DataView object:
// STEP 3: Train the model by fitting the dataview
ITransformer trainedModel = trainingPipeline.Fit(dataView);
In the WinForms app, you load and use the trained model to predict anomalies in the data and then view the detected anomalies from the model by accessing the output column:
var mlcontext = new MLContext();
ITransformer trainedModel;
// Load model
using (FileStream stream = new FileStream(modelPath, FileMode.Open, FileAccess.Read, FileShare.Read))
{
trainedModel = mlcontext.Model.Load(stream);
}
// Apply data transformation to create predictions
IDataView transformedData = trainedModel.Transform(dataView);
var predictions = mlcontext.Data.CreateEnumerable<ShampooSalesPrediction>(transformedData, reuseRowObject: false);
Each Prediction in predictions
returns back a vector containing three values:
- 0 = Alert (0 for no alert, 1 for an alert)
- 1 = Score (value where the anomaly is detected e.g. number of sales)
- 2 = P-value (value used to measure how likely an anomoly is to be true vs. background noise)