Hyperparameter Sweep and NaNs #16

JoakimHaurum · 2024-12-20T00:02:04Z

Im working on reproducing your results for EViT, ATS, DyanmicVIT etc. However, I find that I often run into NaNs about 1/3-1/2 through the training. It doesnt matter if I reserve the prior features or scatter onto a zero-matrix.. I use the config from SViT with no adjustmetns to the optimizer.

Did you observe similar behavior, and what hyperparameters did you use to train the different models; just one fixed set (ie lr = 1e-5) or did you do a sweep per method?

kaikai23 · 2025-01-02T06:11:09Z

Hi Joakim,
Thank you for your interest in reproducing our results! We did not encounter NaN issues during training. Below are some details that might help you debug the problem:

We did not perform hyperparameter sweeps. For all experiments, we used a fixed learning rate (lr = 1e-5) and other default settings without adjustments..
Here’s an example configuration we used for fine-tuning EViT:

# Copyright (c) Shanghai AI Lab. All rights reserved.
_base_ = [
    '../_base_/models/mask_rcnn_r50_fpn.py',
    '../_base_/datasets/coco_instance.py',
    '../_base_/schedules/schedule_0.5x.py',
    '../_base_/default_runtime.py'
]
# pretrained = 'https://dl.fbaipublicfiles.com/deit/deit_tiny_patch16_224-a1311bcf.pth'
# pretrained = 'pretrained/deit_tiny_patch16_224-a1311bcf.pth'
model = dict(
    backbone=dict(
        _delete_=True,
        type='EViTAdapter',
        patch_size=16,
        embed_dim=192,
        depth=12,
        num_heads=3,
        mlp_ratio=4,
        drop_path_rate=0.1,
        layer_scale=False,
        conv_inplane=64,
        n_points=4,
        deform_num_heads=6,
        cffn_ratio=0.25,
        deform_ratio=1.0,
        interaction_indexes=[[0, 2], [3, 5], [6, 8], [9, 11]],
        window_attn=[False] * 12,
        window_size=[None] * 12,
        pretrained=None,
        keep_rate=[1, 1, 1, 0.7, 1, 1, 0.7, 1, 1, 0.7, 1, 1],
        fuse_token=False
    ),
    neck=dict(
        type='FPN',
        in_channels=[192, 192, 192, 192],
        out_channels=256,
        num_outs=5))
# optimizer
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
# augmentation strategy originates from DETR / Sparse RCNN
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True, with_mask=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(type='AutoAugment',
         policies=[
             [
                 dict(type='Resize',
                      img_scale=[(480, 1333), (512, 1333), (544, 1333), (576, 1333),
                                 (608, 1333), (640, 1333), (672, 1333), (704, 1333),
                                 (736, 1333), (768, 1333), (800, 1333)],
                      multiscale_mode='value',
                      keep_ratio=True)
             ],
             [
                 dict(type='Resize',
                      img_scale=[(400, 1333), (500, 1333), (600, 1333)],
                      multiscale_mode='value',
                      keep_ratio=True),
                 dict(type='RandomCrop',
                      crop_type='absolute_range',
                      crop_size=(384, 600),
                      allow_negative_crop=True),
                 dict(type='Resize',
                      img_scale=[(480, 1333), (512, 1333), (544, 1333),
                                 (576, 1333), (608, 1333), (640, 1333),
                                 (672, 1333), (704, 1333), (736, 1333),
                                 (768, 1333), (800, 1333)],
                      multiscale_mode='value',
                      override=True,
                      keep_ratio=True)
             ]
         ]),
    dict(type='RandomCrop',
         crop_type='absolute_range',
         crop_size=(1024, 1024),
         allow_negative_crop=True),
    dict(type='Normalize', **img_norm_cfg),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels', 'gt_masks']),
]
data = dict(samples_per_gpu=4,
            workers_per_gpu=2, #####
            train=dict(pipeline=train_pipeline))
optimizer = dict(
    _delete_=True, type='AdamW', lr=0.00001, weight_decay=0.000001,
    paramwise_cfg=dict(
    custom_keys={
        'level_embed': dict(decay_mult=0.),
        'pos_embed': dict(decay_mult=0.),
        'norm': dict(decay_mult=0.),
        'bias': dict(decay_mult=0.)
    }))
optimizer_config = dict(grad_clip=None)
fp16 = dict(loss_scale=dict(init_scale=512))
checkpoint_config = dict(
    interval=1,
    max_keep_ckpts=3,
    save_last=True,
)

# work_dir = '/data/storage/yifei/output/work_dir/debug'
work_dir = '/net/cephfs/shares/rpg.ifi.uzh/yifei/output/work_dir/mask_rcnn_evit_adapter_tiny_fpn_0.5x_coco'
exp_name = 'det-evit-tiny-0.5x'

# load_from = '/data/storage/yifei/output/work_dir/mask_rcnn_deit_adapter_tiny_fpn_3x_coco/latest.pth'
load_from = '/net/cephfs/shares/rpg.ifi.uzh/yifei/output/work_dir/mask_rcnn_deit_adapter_tiny_fpn_3x_coco/latest.pth'

Feel free to reach out with more details about your training setup if the issue persists!

Best regards,
Yifei

JoakimHaurum · 2025-01-06T10:31:09Z

Thank you for the insights Yifei!

When comparing my config with yours they are pretty much identical.
Could you share your EViTAdapter implementation? I assume you build on the original code base (https://github.com/youweiliang/evit/blob/master/evit.py), and I think the major differences might just be in how the Adapter is set up.

Best
Joakim

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hyperparameter Sweep and NaNs #16

Hyperparameter Sweep and NaNs #16

JoakimHaurum commented Dec 20, 2024

kaikai23 commented Jan 2, 2025

JoakimHaurum commented Jan 6, 2025

Hyperparameter Sweep and NaNs #16

Hyperparameter Sweep and NaNs #16

Comments

JoakimHaurum commented Dec 20, 2024

kaikai23 commented Jan 2, 2025

JoakimHaurum commented Jan 6, 2025