The Data Scientist Nanodegree program tackles topics such as building machine learning models, running data pipelines, designing experiments and recommendation engines, communicate effectively and deploying data applications.
Project_1: The project consists of chosing a subject of interest, finding and analyzing the data and writing a nontechnical data science blog post. The process follows the CRISP-DM methodology. The 2020 Stack Overflow developers survey data is explored.
In the first part, the salaries between data and other developers are compared using a Z-test for independent means. In the second part a machine learning model based on a Random Forest Classifier is used to predict job satisfaction for data developers. The work is done in a Jupyter Notebook, the code is written in Python 3 using NumPy and Pandas.
Project_2: Given a large set of text documents (disaster messages), perform multi-label classification, using supervised machine learning methods. The outcome provides a list of categories each message that is typed in an API belongs to.
A Random Forest Classifier is used as a benchmark model. The final model is based on an Ada Boost Classifier wrapped in a MultiOutput Classifier, tuned via grid search with cross validation. The work is done in Jupyter notebooks, using Python data science libraries Numpy and Pandas, visualizations are created in Matplotlib and Plotly, the text is analyzed with NLTK NLP library. A Flask web app is created and deployed on the Heroku platform.
Project_3: We analyze the interactions that users have with articles on the IBM Watson Studio platform, and make recommendations to them about new articles. The following recommenders are built: rank-based, user-user based collaborative filtering, content based and matrix factorization.
The work is done in Jupyter notebooks, using Python data science libraries from Sklearn, visualizations are created in Matplotlib, the text is analyzed with NLTK NLP library.
Project_4: We are investigating and predicting churn for a fictional music platform called Sparkify. This is a binary classification problem, in which the algorithm has to identify which users are most likely to churn. The best performing classifiers are a Multilayer Perceptron and a Gradient Boosted Tree. The results are further improved using a Meta Classifier Linear Regression stacking model.
The code is written on an Anaconda Jupyter Notebook with a Python3 kernel. Additional libraries and modules used are PySpark, Pandas, Numpy, Matplotlib, Seaborn. The full dataset is trained on an AWS-EMR cluster.