[BUG] eval_descriptor have bug while used for MultiSystems #4533

QuantumMisaka · 2025-01-06T08:16:40Z

Bug summary

While I using this scripts to generate descriptors by DeepPot.eval_descriptor

import dpdata
from deepmd.infer.deep_pot import DeepPot
import numpy as np
import os
import gc
import glob
import logging

datadir = "./data-clean-v2-7-20873-npy"
modelpath = "./FeCHO-dpa231-v2-7-3heads-150w.pt"
savedir = "descriptors"

omp = 16
proc = 4
os.environ['OMP_NUM_THREADS'] = f'{omp}'

def descriptor_from_model(sys: dpdata.LabeledSystem, model:DeepPot):
    coords = sys.data["coords"]
    cells = sys.data["cells"]
    model_type_map = model.get_type_map()
    type_trans = np.array([model_type_map.index(i) for i in sys.data['atom_names']])
    atypes = list(type_trans[sys.data['atom_types']])
    predict = model.eval_descriptor(coords, cells, atypes)
    return predict
#alldata = dpdata.MultiSystems.from_dir(datadir,datakey,fmt="deepmd/npy")
all_set_directories = glob.glob(os.path.join(
    datadir, '**', 'set.*'), recursive=True)
all_directories = set()
for directory in all_set_directories:
    coord_path = os.path.join(directory, 'coord.npy')
    if os.path.exists(coord_path):
        all_directories.add(os.path.dirname(directory))
all_directories = list(all_directories)

model = DeepPot(modelpath, head="Target_FTS")

logging.basicConfig(
    level=logging.INFO, 
    format='%(asctime)s - %(levelname)s - %(message)s',  
    datefmt='%Y-%m-%d %H:%M:%S'  
)

logging.info("Start Generating Descriptors")

if not os.path.exists(savedir):
    os.mkdir(savedir)

with open("running", "w") as fo:
    for onedir in all_directories:
        onedata = dpdata.LabeledSystem(onedir, fmt="deepmd/npy")
        key = onedata.short_name
        save_key = f"{savedir}/{key}"
        logging.info(f"Generating descriptors for {key}")
        if os.path.exists(save_key):
            if os.path.exists(f"{save_key}/desc.npy"):
                logging.info(f"Descriptors for {key} already exist, skip")
                continue
        else:
            os.mkdir(save_key)
        desc = descriptor_from_model(onedata, model)
        logging.info(f"Descriptors for {key} generated")
        
        np.save(f"{savedir}/{key}/desc.npy", desc)
        logging.info(f"Descriptors for {key} saved")

logging.info("All Done !!!")
os.system("mv running done")

RuntimeError will arise after one eval_descriptor for LabeledSystem

2025-01-06 15:58:57 - INFO - Start Generating Descriptors
2025-01-06 15:58:57 - INFO - Generating descriptors for O0H6Fe48C8
2025-01-06 15:59:00 - INFO - Descriptors for O0H6Fe48C8 generated
2025-01-06 15:59:00 - INFO - Descriptors for O0H6Fe48C8 saved
2025-01-06 15:59:00 - INFO - Generating descriptors for O3H4Fe0C6
Traceback (most recent call last):
  File "/home/mps/liuzq/FeCHO-dpa2/300rc0/v2-7-3h-100w/desc-gen/gen_desc.py", line 64, in <module>
    desc = descriptor_from_model(onedata, model)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mps/liuzq/FeCHO-dpa2/300rc0/v2-7-3h-100w/desc-gen/gen_desc.py", line 25, in descriptor_from_model
    predict = model.eval_descriptor(coords, cells, atypes)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/infer/deep_eval.py", line 445, in eval_descriptor
    descriptor = self.deep_eval.eval_descriptor(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/pt/infer/deep_eval.py", line 658, in eval_descriptor
    descriptor = model.eval_descriptor()
                 ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/pt/model/model/dp_model.py", line 66, in eval_descriptor
    def eval_descriptor(self) -> torch.Tensor:
        """Evaluate the descriptor."""
        return self.atomic_model.eval_descriptor()
               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
  File "/home/mps/miniconda3/envs/deepmd-3rc0/lib/python3.11/site-packages/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 76, in eval_descriptor
    def eval_descriptor(self) -> torch.Tensor:
        """Evaluate the descriptor."""
        return torch.concat(self.eval_descriptor_list)
               ~~~~~~~~~~~~ <--- HERE
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 62 but got size 13 for tensor number 1 in the list.

DeePMD-kit Version

DeePMD-kit v3.0.0rc0

Backend and its version

Pytorch 2.5.1

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

Model , dataset and scripts used is in
https://www.jianguoyun.com/p/DS0CkjUQrZ-XCRiyh-cFIAA (access code：4Te2ER)

Steps to Reproduce

tar -zxvf tar.gz file
run gen_desc.py in deepmd-kit 3.0.0-rc0 env with dpdata installed

Further Information, Files, and Links

No response

The text was updated successfully, but these errors were encountered:

Fix deepmodeling#4533. Signed-off-by: Jinzhe Zeng <[email protected]>

QuantumMisaka · 2025-01-06T09:35:06Z

@njzjz Thanks for your rapid reply!
Another related discussion: while directly using DeepPot.eval_descripor to deal with a LabeledSystem with large number of nframe (> 2000) in GPU, the memory requirement always up to > 40GB, lead to OOM error, do you have any advice for controling the memory consumption ?

QuantumMisaka · 2025-01-06T11:25:21Z

@njzjz while using dp test, the memory in GPU seems to be able to self-adaptive, so is it possible to use this eval_descriptor method in cmd like what dp test is done ? Detailed suggestion in #4503

njzjz · 2025-01-06T11:51:19Z

Sorry, I just realized you mean a large number of frames, not atoms.

njzjz · 2025-01-06T11:54:09Z

The automatic batch size is used by eval. Both dp test and eval_descriptor call eval, so I believe the memory should be handled properly.

QuantumMisaka · 2025-01-06T11:56:16Z

@njzjz Thanks for your reply !
I'll test twice after this bug fixed and open another issue if the related OOM problem exists

Fix #4533.  ## Summary by CodeRabbit - **Bug Fixes** - Improved list clearing mechanism in `DPAtomicModel` class - Enhanced test coverage for descriptor evaluation in `TestDeepPot`  Signed-off-by: Jinzhe Zeng <[email protected]>

QuantumMisaka added the bug label Jan 6, 2025

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Jan 6, 2025

fix(pt): fix clearing the list in set_eval_descriptor_hook

e0bb014

Fix deepmodeling#4533. Signed-off-by: Jinzhe Zeng <[email protected]>

njzjz mentioned this issue Jan 6, 2025

fix(pt): fix clearing the list in set_eval_descriptor_hook #4534

Merged

njzjz linked a pull request Jan 6, 2025 that will close this issue

fix(pt): fix clearing the list in set_eval_descriptor_hook #4534

Merged

njzjz self-assigned this Jan 6, 2025

This comment has been minimized.

Sign in to view

njzjz closed this as completed Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] eval_descriptor have bug while used for MultiSystems #4533

[BUG] eval_descriptor have bug while used for MultiSystems #4533

QuantumMisaka commented Jan 6, 2025

QuantumMisaka commented Jan 6, 2025

This comment has been minimized.

QuantumMisaka commented Jan 6, 2025

njzjz commented Jan 6, 2025

njzjz commented Jan 6, 2025

QuantumMisaka commented Jan 6, 2025

[BUG] eval_descriptor have bug while used for MultiSystems #4533

[BUG] eval_descriptor have bug while used for MultiSystems #4533

Comments

QuantumMisaka commented Jan 6, 2025

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Steps to Reproduce

Further Information, Files, and Links

QuantumMisaka commented Jan 6, 2025

This comment has been minimized.

QuantumMisaka commented Jan 6, 2025

njzjz commented Jan 6, 2025

njzjz commented Jan 6, 2025

QuantumMisaka commented Jan 6, 2025