Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop with OSError when run "higashi_model.train_for_imputation_nbr_0()" #42

Open
yufanzhouonline opened this issue Aug 4, 2023 · 6 comments

Comments

@yufanzhouonline
Copy link

Hi Ruochiz,

Higashi run very well without any errors when the resolution was 1M (JSON file option "resolution") in my CentOS 7 system.

However, when the resolution increased, no matter which resolution, there was always an OSError as follows after the step "higashi_model.train_for_imputation_nbr_0()":

[ Epoch 42 of 45 ]

  • (Train) bce: 0.3479, mse: 0.0000, acc: 96.450 %, pearson: 0.571, spearman: 0.634, elapse: 97.359 s
  • (Valid) bce: 2.7560, acc: 97.025 %,pearson: 0.187, spearman: 0.635,elapse: 0.296 s
    no improve: 1
    [ Epoch 43 of 45 ]
  • (Train) bce: 0.3542, mse: 0.0000, acc: 96.321 %, pearson: 0.557, spearman: 0.633, elapse: 96.495 s
  • (Valid) bce: 3.0346, acc: 96.726 %,pearson: 0.123, spearman: 0.633,elapse: 0.355 s
    no improve: 2
    [ Epoch 44 of 45 ]
  • (Train) bce: 0.3466, mse: 0.0000, acc: 96.488 %, pearson: 0.599, spearman: 0.634, elapse: 99.352 s
  • (Valid) bce: 3.6074, acc: 97.016 %,pearson: 0.157, spearman: 0.636,elapse: 0.356 s
    no improve: 3
    • (Validation) : 0%| | 0/10 [00:00<?, ?it/s]Traceback (most recent call last):
      File "", line 1, in
      File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 1367, in train_for_imputation_nbr_0
      self.train(
      File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 1141, in train
      valid_bce_loss, valid_accu, valid_auc1, valid_auc2, _, _ = self.eval_epoch(validation_data_generator)
      File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/site-packages/higashi/Higashi_wrapper.py", line 994, in eval_epoch
      pool = ProcessPoolExecutor(max_workers=cpu_num)
      File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/concurrent/futures/process.py", line 658, in init
      self._result_queue = mp_context.SimpleQueue()
      File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/multiprocessing/context.py", line 113, in SimpleQueue
      return SimpleQueue(ctx=self.get_context())
      File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/multiprocessing/queues.py", line 340, in init
      self._reader, self._writer = connection.Pipe(duplex=False)
      File "/data/yufan/biotools/anaconda/anaconda2023/envs/higashi/lib/python3.9/multiprocessing/connection.py", line 527, in Pipe
      fd1, fd2 = os.pipe()
      OSError: [Errno 24] Too many open files

Would you please let me know the reason of issue?

Thanks a lot.

Yufan (Harry) Zhou

@ruochiz
Copy link
Collaborator

ruochiz commented Aug 14, 2023

Hum. Could you try to re-run that with fewer cpu workers.

Or you can try to do this, to increase the maximum number of open files:

# Check current limit
$ ulimit -n
256

# Raise limit to 2048
# Only affects processes started from this shell
$ ulimit -n 2048

$ ulimit -n
2048

@yufanzhouonline
Copy link
Author

Hi Ruochi, thank you so much for your reply. I have increased the number with the command “ulimit -n 4096” and only use 8 CPU in a total 128-CPU server. But Higashi still doesn’t work with the same error as before. I also contacted Dr. Jian Ma for helps and he suggested me to continue to discuss with you on GitHub. Would you please help me to solve this issue? Thanks.

@ruochiz
Copy link
Collaborator

ruochiz commented Sep 24, 2023

Hum. I must say this error is really strange, but looks like due to how python multiprocessing is handled by linux system. Do you notice any memory being used up when the error shows? It's possible that the system is writing to swap partition when running out of memory

@seraphzl
Copy link

Same issue.

$ ulimit -n
1048576

@seraphzl
Copy link

The code set a low value, you can change this to a larger one, or comment this line to cancel this limit.

resource.setrlimit(resource.RLIMIT_NOFILE, (3600, rlimit[1]))

Solved!

@ruochiz
Copy link
Collaborator

ruochiz commented Aug 29, 2024

Oh, I see, thx for spotting this. I'll increase that in the code as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants