This repository contains my Pix2Pix implementation that was used for a Kubeflow Pipelines demo at Google Next 2019
There were several goals behind this project:
- Experimenting with image translations using GAN/Pix2Pix and Tensorflow
- "Benchmarking" a GPU datalab server at work
- Experimenting and developing expertise with the Kubeflow ecosystem (+sharing the results and promoting Kubeflow inside our organization at work)
When I first started developping this project, I focused a lot on the early stage of a Data Science project, when a Data Scientist wants to test new ideas without having to care where the code will be executed, and without any needs to manually build docker containers to package the code. Since then, this is even more easier with Kubeflow Fairing.
The main reasons why you should use Kubeflow are :
- It allows easy, repeatable, portable deployments on a diverse infrastructure (laptop <-> ML rig <-> training cluster <-> Production cluster)
- It makes the work easier and more efficient. Data Scientists can iterate faster
- In the end, Data Scientists don't have to care about setting up the environment where the code will be executed
- ’On-prem” clusters resources are not unlimited, and not always available.
- Data Scientists don’t want to spend time configuring some architecture to run their experiments.
- One of the best features of using GKE and Kubeflow together, it that you can setup your GKE Cluster with autoscaling. For example, you can add a GPU node-pools to your Kubeflow cluster, that scale on-demand GPU nodes from 0 to whatever value you want.
- This type of hybrid configuration makes the work very efficient, and cost-efficient, as you can iterate faster, while paying only for the extra GKE resources you need to expand an « on-prem » cluster/workstation.
Pix2Pix work based on : "Image-to-Image Translation with Conditional Adversarial Networks"
See : arXiv:1611.07004v3 [cs.CV] by Phillip Isola, Jun-Yan Zhu, Tinghui Zhou and Alexei A. Efros - Project Homepage
You might also be interested in more recent works:
- Pix2Pix Tensorflow 2.0 tutorial
- Pix2Pix HD "High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs"
The maps dataset used contain 1096 training images scraped from Google Maps
Follow the setup instructions
This demo requires a Google Cloud Platform (GCP) project with an IAP-enabled cluster running on Kubernetes Engine (GKE). The Pix2Pix code requires Python 3.5, Tensorflow 1.12 and Kubeflow 0.5.0.
- Prerequisites : Have a GCP account with billing enabled
- Create a GCP project (don't forget to adjust the GPU quotas for the considered region)
- An endpoint protected by GCP IAP will be created for accessing kubeflow. Follow these instructions to create an OAuth client and then enter as IAP Oauth Client ID and Secret
- Create a Kubeflow cluster on GKE, using the Google Cloud Deploy tool
- Perform additional Setup for REMOTE Kubeflow pipeline execution
- Create GCP service account with the necessary permissons, and added as an 'IAP-secured Web App User'
- Create a storage bucket in Google Storage
- Add an NFS server running in GKE inside the Kubeflow namespace
- Create a GPU node pool with Autoscaling enabled in GKE
June/July 2019 : It seems that there are some issues with Kubeflow 0.5 when using IAP. In particular I have noticed some difficulties to access Jupyter Notebooks or spawning a Jupyter server that mounts a NFS volume. Until the issue is solved, a bypass solution is provided in order to avoid using IAP.
- Include the instructions to setup the cluster for REMOTE Kubeflow pipeline execution (instead of external reference)
- Rework the GKE cluster setup for deployment automation (DevOps for MLOps ! )
- Investigate Pix2Pix HD
- Use a cleaner dataset (without traffic information...)
- Include a Kubeflow Fairing version
Setup your local environment and your GKE cluster, clone the repository or import the source code, and execute the notebooks !
Don't forger to customize in the Notebooks your bucket URL, Kubeflow cluster endpoints and client_id
1-Pix2Pix-local.ipynb
is expected to be executed locally in a Jupyter notebook running on anykind of GPU server (without Kubeflow)2-Pix2Pix-KFP.ipynb
is expected to be executed inside the Kubeflow's Jupyter notebook running on a GKE cluster (with some GPU nodes pool available).3-Pix2Pix-KFP-REMOTE.ipynb
is expected to be executed in a local Jupyter notebook instance (for instance running on a laptop without GPU), and it will interact with a Kubeflow cluster running on GKE
Results after ~200 steps (200 epochs with batch size of 1) :
From Left to Right : Source Image / Generated ("Translated") Image / Target Image (ground truth)
In order to avoid some kind of "noise artifacts" to appears in the generated images, and while it was not specified in the Author's paper or original Torch code, I have added some gradient clipping during the optimization steps which solved the problem immediately.
Example of "Noise artifacts" that were appearing in the generated images before adding Gradient Clipping :
See Google Next 2019 Breakout session : ML Ops Best Practices on Google Cloud (Apr 2019)
Creating an ML model is just a starting point. To bring it into production, you need to solve various real-world issues, such as building a pipeline for continuous training, automated validation of the model, scalable serving infrastructure, and supporting multiple environments in increasingly common hybrid and multi-cloud setups. In this session, we will learn the concept of "ML Ops" (DevOps for ML) and how to leverage various Google initiatives like TFX, Kubeflow Fairing (Hybrid ML SDK) and Kubeflow Pipelines to build and maintain production quality ML systems.
Speakers :
- Kaz Sato, Developer Advocateat Google Cloud
- Zia Syed Sr. Engineering Manager at Google Cloud
- Robin Zondag, Global Head of Atos AI/ML Labs
- Fabien Da Silva, Artificial Intelligence Expert at Atos/Worldline
Thanks to the Google Team for all their support during the preparation of Google Next event (Special thanks to Amy Unruh for the great support and tutorials !)