Machine learning project for the TA position at Georgetown.
Greg Barbieri - @gfbarbieri
The goal of this project is to demonstrate knowledge and use of the data science pipeline, as well as machine learning models and the Yellowbrick library. I accomplish these goals by predicting the income categories of households participating in the Consumer Expenditure (CE) Interview survey, a survey conducted by the Census Bureau on behalf of the Bureau of Labor Statistics (BLS).
Three notebooks were drafted, each covering distinct portions of the data science pipeline: (1) ingestion and wrangling, (2) feature selection, and (3) model selection and evaluation including hyper-parameter tuning. The request was for a single notebook that covers the entire pipeline and demonstrates competency with ML models and the Yellowbrick library, so I created an abridged version of the three parts.
Part 1 demonstrates ingesting data over the web, using preprocessing techniques to encode a target variable, aggregating data to create features, using visual techniques to assess the quality of the data, and storing the data using WORM methodology.
Part 2 demonstrates assessing the relationship between features, specifically utilizing the Yellowbrick library to assess colinearity and class separation. This part demonstrates multiple approaches to scaling features using the sklearn library. Lastly, this notebook covers feature selection using feature importance and feature elimination techniques.
Part 3 demonstrates preprocessing, modeling using multiple classifiers, model evaluation using Yellowbrick's classification model visualizers, hyper-parameter tuning, and model output.
data: Data resulting from ingestion and wrangling process. Data is ingested from the CE PUMD website and stored in a SQLite database table.
notebooks: The final notebook covering all competencies is Predicting_Household_Income.ipynb. Groups of topics are covered in three parts: ingestion and wrangling, feature selection, and model selection and evaluation.
model: Encoding, scaling, and model parameters.
Note on Prediction: Prediction typically means using past data to estimate future data, where past, present, and future take their usual place in time. Since the data I am using to predict household income (expenditures) was created chronologically after the household income was earned or expected to be earned, a better phrase is that I am explaining, inferring, or imputing household income. In fact, using data produced choronologically after the data one is trying to predict may become a source of data leakage. That said, in the real world, data leakage is more concerned about when the data is available to the model, not when it is theoretically generated by the user. That is, "future data" is simply the data you wish you had and "current" or "past" data are the data you have available now to make a prediction.