Skip to content

Github Actions Runner for training classifiers with CUDA acceleration using nvidia-docker and tensorflow core

License

Notifications You must be signed in to change notification settings

Incuvers/train-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Classifier Training Action

ci

Modified: 2021-05

img

Navigation

  1. About
  2. Host Preconfiguration
  3. Usage
  4. Deployment
  5. License

CUDA Accelerated Machine Learning

The server enables remote classifier training jobs to be executed on a host with an NVIDIA CUDA supported GPU with at least 4GB of VRAM. The code uses nvidia-docker in combination with a tensorflow/tensorflow:latest-gpu docker image to mount and execute a tensorflow-powered python training with full access to the hosts CUDA cores.

The code in this repository is executed as defined by the action.yaml file in the root. This action can be invoked in another repository's build-spec by pointing to this action (see Action Usage). This action is not deployed to a server directly and instead is pulled by the github action runner when the build-spec requires this action. This way subsequent updates to this build action on the target branch will be automatically be pulled by the build server so it is always running the latest source code.

Action Usage

This action requires actions/checkout@v2 for access to the target repository's source code and actions/upload-artifact@v2 for pushing the training results to the action runner dashboard. This action will copy the source code into its working directory mounting it to the container, installing the neccesary dependancies and executing the module.

The target repository must have a filesystem schema similar to what is outlined below:

├── .github
│   └── workflows
│       └── train.yaml
├── .gitignore
├── README.md
├── artefacts
│   ├── .gitignore
│   └── README.md
├── mnist
│   ├── __init__.py
│   ├── __main__.py
│   ├── apt-pkgs.txt
│   ├── requirements.txt
│   └── sample
│       ├── __init__.py
│       └── runner.py
└── yolo
    ├── __init__.py
    ├── __main__.py
    ├── apt-pkgs.txt
    ├── requirements.txt
    └── sample
        ├── __init__.py
        └── runner.py
  ...

Here are the primary requirements:

  1. self contained python modules (mnist and yolo) each with their own entry points __main__.py and apt/pip requirement files.
  2. artefacts/ directory for writing training results of each classifer model

Below is a sample model entry point for the yolo classifer:

import os
import logging
import requests
from pathlib import Path
from distutils.dir_util import copy_tree
from yolo.mnist.make_data import generate
from yolo.train import main

FILENAME="yolov3-tiny.weights"
URL="https://pjreddie.com/media/files/{}".format(FILENAME)

logging.basicConfig(
    format="%(asctime)s %(levelname)s server %(message)s",
    level=logging.DEBUG
)

def fetch_weights() -> None:
    md_dir = str(Path(__file__).parent.joinpath("model_data"))
    if not os.path.exists(md_dir):
        os.mkdir("model_data")
    r = requests.get(URL)
    logging.info("Downloaded weights from %s", URL)
    with open("{}/{}".format(md_dir, FILENAME), 'wb') as f:
        f.write(r.content)
    logging.info("Wrote file to %s", "{}/{}".format(md_dir, FILENAME))

fetch_weights()
generate()
main()

# copy artefacts to global path 
src = str(Path(__file__).parent.joinpath("checkpoints"))
dest = str(Path(__file__).parent.parent.joinpath("artefacts"))
if not os.path.exists(dest): os.mkdir(dest)
copy_tree(src,dest)

The action takes in a MODEL specifier corresponding to the name of the target python module (containing directory name). Below is a sample buildspec job training an mnist model using the train-classifier action:

train:
  name: train classifier
  runs-on: [ self-hosted, linux, docker, X64 ]
  steps:
    - name: checkout src
    uses: actions/checkout@v2
    - name: train classifier build action 
    uses: Incuvers/train-classifier@master
    env:
      MODEL: mnist
      SLACK_IDENTIFIER: ${{ secrets.SLACK_IDENTIFIER }}
    - name: upload training artefacts
    uses: actions/upload-artifact@v2
    with:
      name: artefacts
      path: artefacts.tar.gz
      retention-days: 5
    - name: Notify
    run: |
      curl -X POST -H 'Content-type: application/json' \
      --data "{\"text\":\"Model training complete. Download the artefacts here: https://github.com/Incuvers/handwriting-recognition/actions/runs/$GITHUB_RUN_ID\"}"\
      https://hooks.slack.com/services/${{ secrets.SLACK_IDENTIFIER }}

Train Classifier Deployment

To be implemented using ansible. See server preconfiguration for host setup.

License

GNU General Public License v3

About

Github Actions Runner for training classifiers with CUDA acceleration using nvidia-docker and tensorflow core

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published