You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why do I encounter 'loss': 0.0, 'grad_norm': tensor(nan, device='cuda:0', dtype=torch.float64) when fine-tuning llava-v1.5-7b using the dpo code from the llava-next repository? Below is my training script, and I have ensured that my training dataset is fine.
Why do I encounter 'loss': 0.0, 'grad_norm': tensor(nan, device='cuda:0', dtype=torch.float64) when fine-tuning llava-v1.5-7b using the dpo code from the llava-next repository? Below is my training script, and I have ensured that my training dataset is fine.
export OMP_NUM_THREADS=8
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_IB_HCA=${ARNOLD_RDMA_DEVICE}
export NCCL_SOCKET_IFNAME=lo
export NCCL_DEBUG=INFO
VISION_MODEL_VERSION="openai/clip-vit-large-patch14-336"
VISION_MODEL_VERSION_CLEAN="${VISION_MODEL_VERSION////_}"
MID_RUN_NAME="llava-1.5-7b-dpo-v1"
############### Pretrain ################
Stage 2
PROMPT_VERSION="v1"
#torchrun --nproc_per_node="${ARNOLD_WORKER_GPU}" --nnodes="${ARNOLD_WORKER_NUM}" --node_rank="${ARNOLD_ID}" --master_addr="${METIS_WORKER_0_HOST}" --master_port="${port_in_cmd}"
ACCELERATE_CPU_AFFINITY=1 torchrun --nproc_per_node=4 --nnodes=1 --node_rank="${RANK}" --master_addr=30.246.96.60 --master_port=23456
llava/train/train_dpo.py
--deepspeed scripts/zero3.json
--model_name_or_path "/model_weight/liuhaotian--llava-v1.5-7b.main.4481d270cc22fd5c4d1bb5df129622006ccd9234"
--version $PROMPT_VERSION
--dpo_alpha 1.0 --beta 0.1 --gamma 0
--data_path=processed_data
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model"
--vision_tower ${VISION_MODEL_VERSION}
--mm_projector_type mlp2x_gelu
--mm_vision_select_layer -2
--mm_use_im_start_end False
--mm_use_im_patch_token False
--mm_spatial_pool_stride 2
--mm_resampler_type "spatial_pool"
--mm_spatial_pool_out_channels 1024
--group_by_modality_length True
--image_aspect_ratio pad
--bf16 True
--run_name $MID_RUN_NAME
--output_dir "llava1_5_dpo/${MID_RUN_NAME}"
--num_train_epochs 1
--per_device_train_batch_size 2
--per_device_eval_batch_size 4
--gradient_accumulation_steps 4
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 3000
--save_total_limit 1
--learning_rate 5e-7
--weight_decay 0.
--warmup_ratio 0.1
--lr_scheduler_type "cosine"
--logging_steps 1
--tf32 True
--model_max_length 32768
--gradient_checkpointing True
--dataloader_num_workers 16
--lazy_preprocess True
--report_to "none"
--torch_compile True
--torch_compile_backend "inductor"
--dataloader_drop_last True
The text was updated successfully, but these errors were encountered: