Most data-science projects have the same set of tasks:
- ETL: extracting data from its source, transforming it, then loading it into a database.
- Pre-process data: This might include imputing missing values and choosing the training and testing sets.
- Train the model(s): You can try different algorithms, features, and so on.
- Assess performance on the test set: Using an appropriate accuracy metric (e.g. AUC), examine the performance of your model "out of sample."
- Think of new things to try. Repeat steps 1 through 4 as appropriate.
Often, by the time you build a couple dozen models, you're struggling to remember the details of each. What features did you use for each? What training and testing split? What hyperparameters?
Your code might be getting messy too. Did you overwrite the code for the previous model? Maybe you copied, pasted, and edited code from an earlier model. Can you still read what's there? It can quickly become a hodgepodge that requires heroics to decipher.
In this session, we will introduce the data pipeline, an approach that helps you simplify the modeling process.
A data pipeline is a set of code that handles all the computational tasks your project needs from beginning to end. The typical data pipeline is a set of functions strung together. Here's a simple example using scikit-learn's boston dataset:
This pipeline has two steps. The first, which I call "preprocessing," prepares the data for modeling by creating training and testing splits. The second, which I call "models, predictions, and metrics," uses the preprocessed data to train models, make predictions, and print r^2 on the test set. The pipeline takes inputs (e.g. data, training/testing proportions, and model types) at one end and produces outputs (accuracy) at the other end.
Obviously, this analysis is incomplete, but the pipeline is a good start. Because we use the same code and data, we can run the pipeline from beginning to end and get the same results. And because we split the pipeline into functions, we can identify where the pipeline goes wrong and improve the pipeline one function at a time. (Each function just needs to use the same inputs and outputs as before.)
Also note the function and loops in the second part of the pipeline. We're somewhat agnostic about the methods we use. If it works, great! This structure lets us loop through many types of models using the same preprocessed data and the same predictions and metrics. It makes adding new methods and comparing the results easier, and it helps us focus on other parts of the pipeline, such as feature generation.
Aren't pipelines super duper?
Our projects are far more complex than this Boston example, and our pipelines reflect that. Here's what a typical DSSG pipeline looks like:
The police pipeline, started at DSSG 2015, is an example of a relatively well developed pipeline. It lets us specify the pipeline options we want in a yaml file, from preprocessing on. (The code in this repository does not include ETL.) It gives us many modeling options, and it makes comparisons easy.
Much of your work will revolve around and within your pipeline, but we have identified specific aspects for you to focus on each week:
- Our lead pipeline, started at DSSG 2014
- Our Cincinnati pipeline, started at DSSG 2015
- Triage (a generalized DSSG pipeline)
- Data Science Toolbox