Skip to content

Meetings

Butch Landingin edited this page Jul 9, 2020 · 1 revision

Meetings

Meeting: 2020-07-06 9-11 PM PDT

Attendees: @butchland @tyoc213

Agenda:

  • brainstorm roadmap and approaches what we need to study
  • status

Discussion:

  • Status

    • Things running:
      • dataloaders(...device=tpu) - will move batch inputs andtargets to tpu device
      • wrapping opt_func with XLAOptFuncWrapper will call xm.optimizer_step(self.opt) correctly during opt.step()
      • calling cnn_learner(dls,...) will create learner will create model.parameters() in tpu, due to code that sets model device to dataloader device (which has been previously set to tpu)
      • calling learn.recorder.plot_loss() is working.
    • Things that funky or not running:
      • using batch_tfms=aug_transforms() in dataloaders seems to slow training down.
      • using learner.fine_tune(1) causes the train and valid loss to go up after the unfreeze step
      • lr_find() shows a funky graph.
        • see this section
      • using Classification.Interpretation and running most_confused() throws an index error.
      • tyoc213: Adding tensorboard callback and training also throws index error
  • Next steps

    • Approach:
      • Don't go for big-bang (+ lots of debug) approach
      • Start with the simplest model and learner that is running on a TPU
        • then start adding more fastai stuff incrementally and at each step, retest
        • if bug found, then let's fix that quickly before moving on.
      • Goal: start from the simplest stuff that is running, then iterate in small steps, but quickly.
      • Along side iterating in rapid cycles, keep track of GPU baseline vs TPU performance.
        • slow TPU performance may indicate something funky in our implementation or in fastai code vis-a-vis TPU.
      • Put off multi TPU core for later - once we get single TPU running well.
    • Plans:
      • Next meeting schedule: 2020-07-09 6-830PM PDT
      • In the meantime:
        • @tyoc213 to fork project
        • build a simple baseline model
          • run on GPU and measure loading/training/inference time for baseline GPU performance
          • simplest learner possible that is running on TPU
            • no batch transforms
            • minimum training callbacks
            • simple and small dataset
            • simple model architecture
          • measure loading/training/inference time for baseline TPU performance
        • start looking at bugs
          • @butchland - look at slow running of batch transforms - look at increasing train and valid loss on unfreeze (use fit not fit_one_cycle)
          • @tyoc - fork project and start studying it
Clone this wiki locally