Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The main_cell.py is so slow #39

Open
yangfeizZZ opened this issue Jun 8, 2023 · 5 comments
Open

The main_cell.py is so slow #39

yangfeizZZ opened this issue Jun 8, 2023 · 5 comments

Comments

@yangfeizZZ
Copy link

Hello
when i run main_cell.py ,it is very very very slow. I start run main_cell.py at Monday this week, but the result as flow until now.

[ Epoch 29 of 80 ]

  • (Training) BCE: 0.348 MSE: 0.719 Loss: 0.349 norm_ratio: 0.00: 32%|▎| 321/1000 [44:21<1:30:05, 7.96s/it]

So i want to know how to more faster. Thank you very much.

@ruochiz
Copy link
Collaborator

ruochiz commented Jun 8, 2023

Hey, did you train the model on GPU device or CPU device, and what would be the CPU / GPU utilization.

@yangfeizZZ
Copy link
Author

Hey, did you train the model on GPU device or CPU device, and what would be the CPU / GPU utilization.
I used GPU,but it has error:

[ Epoch 38 of 60 ]

  • (Training) bce: 0.1953, mse: 0.0000, acc: 98.688 %, pearson: 0.943, spearman: 0.643, elapse: 152.854 s
  • (Validation-hyper) bce: 0.1811, acc: 99.596 %,pearson: 0.968, spearman: 0.646,elapse: 0.101 s
    no improve 4
    [ Epoch 39 of 60 ]
  • (Training) bce: 0.1946, mse: 0.0000, acc: 98.729 %, pearson: 0.944, spearman: 0.643, elapse: 148.983 s
  • (Validation-hyper) bce: 0.1793, acc: 99.619 %,pearson: 0.971, spearman: 0.648,elapse: 0.122 s
    no improvement early stopping
  • (Validation-hyper) bce: 0.1806, acc: 99.606 %, auc: 0.966, aupr: 0.647,elapse: 0.564 s
    Traceback (most recent call last):
    File "/home/yangfei/Higashi/higashi/main_cell.py", line 1472, in
    select_gpus[i])
    TypeError: 'NoneType' object is not subscriptable

@ruochiz
Copy link
Collaborator

ruochiz commented Jun 9, 2023

Could you try to run nvidia-smi -q -d Memory |grep -A4 GPU|grep Free and nvidia-smi -q -d Memory |grep -A4 GPU in you command line and see what it returns. Higashi uses a hacky way to figure out how many GPUs you had ,and that can be not compatible for some cuda version.

Also, what did you put in the 'gpu_num' parameter in the config.JSON file, and how many GPU cards do you have on that machine.

@yangfeizZZ
Copy link
Author

Could you try to run nvidia-smi -q -d Memory |grep -A4 GPU|grep Free and nvidia-smi -q -d Memory |grep -A4 GPU in you command line and see what it returns. Higashi uses a hacky way to figure out how many GPUs you had ,and that can be not compatible for some cuda version.

Also, what did you put in the 'gpu_num' parameter in the config.JSON file, and how many GPU cards do you have on that machine.

I don't know what's mean of "no improvement early stopping"? Is it mean the trianing is ok so it stop

@yangfeizZZ
Copy link
Author

Could you try to run nvidia-smi -q -d Memory |grep -A4 GPU|grep Free and nvidia-smi -q -d Memory |grep -A4 GPU in you command line and see what it returns. Higashi uses a hacky way to figure out how many GPUs you had ,and that can be not compatible for some cuda version.

Also, what did you put in the 'gpu_num' parameter in the config.JSON file, and how many GPU cards do you have on that machine.

I set "gpu_num": 2, but it has same error. So I don't know when it means the end of training and can be visualized

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants