Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] (tf backend) Print NAN when training dipole model with reference data #4536

Closed
ChiahsinChu opened this issue Jan 6, 2025 · 7 comments · Fixed by #4538
Closed

[BUG] (tf backend) Print NAN when training dipole model with reference data #4536

ChiahsinChu opened this issue Jan 6, 2025 · 7 comments · Fixed by #4538
Assignees
Labels

Comments

@ChiahsinChu
Copy link
Contributor

ChiahsinChu commented Jan 6, 2025

Bug summary

When training dipole models with deepmd v3, the loss printed is nan even in the presence of reference data.

DeePMD-kit Version

DeePMD-kit v3.0.0

Backend and its version

TensorFlow v2.18.0-rc2-4-g6550e4bd802

How did you download the software?

pip

Input Files, Running Commands, Error Log, etc.

Input files (deepmd-kit/examples/water_tensor/dipole/dipole_input.json but use datasets with atomic dipoles only):

{
  "_comment1": " model parameters",
  "model": {
    "type_map": [
      "O",
      "H"
    ],
    "descriptor": {
      "type": "se_e2_a",
      "sel": [
        46,
        92
      ],
      "rcut_smth": 3.80,
      "rcut": 4.00,
      "neuron": [
        25,
        50,
        100
      ],
      "resnet_dt": false,
      "axis_neuron": 6,
      "type_one_side": true,
      "precision": "float64",
      "seed": 1,
      "_comment2": " that's all"
    },
    "fitting_net": {
      "type": "dipole",
      "sel_type": [
        0
      ],
      "neuron": [
        100,
        100,
        100
      ],
      "resnet_dt": true,
      "precision": "float64",
      "seed": 1,
      "_comment3": " that's all"
    },
    "_comment4": " that's all"
  },
  "learning_rate": {
    "type": "exp",
    "start_lr": 0.01,
    "decay_steps": 5000,
    "_comment5": "that's all"
  },
  "loss": {
    "type": "tensor",
    "pref": 0.0,
    "pref_atomic": 1.0,
    "_comment6": " that's all"
  },
  "_comment7": " traing controls",
  "training": {
    "training_data": {
      "systems": [
        "./training_data_reformat/atomic_system"
      ],
      "batch_size": "auto",
      "_comment8": "that's all"
    },
    "validation_data": {
      "systems": [
        "./validation_data_reformat/atomic_system"
      ],
      "batch_size": 1,
      "numb_btch": 3,
      "_comment9": "that's all"
    },
    "numb_steps": 2000,
    "seed": 10,
    "disp_file": "lcurve.out",
    "disp_freq": 100,
    "save_freq": 1000,
    "_comment10": "that's all"
  },
  "_comment11": "that's all"
}

Running commands:

dp train dipole_input.json

Error log:
lcurve.out:

# step       rmse_val    rmse_trn   rmse_lc_val rmse_lc_trn   lr
# If there is no available reference data, rmse_*_{val,trn} will print nan
      0      0.00e+00    0.00e+00           nan         nan    1.0e-02
    100      0.00e+00    0.00e+00           nan         nan    5.0e-03
    200      0.00e+00    0.00e+00           nan         nan    2.5e-03
    300      0.00e+00    0.00e+00           nan         nan    1.3e-03
    400      0.00e+00    0.00e+00           nan         nan    6.3e-04
    500      0.00e+00    0.00e+00           nan         nan    3.2e-04
    600      0.00e+00    0.00e+00           nan         nan    1.6e-04
    700      0.00e+00    0.00e+00           nan         nan    7.9e-05
    800      0.00e+00    0.00e+00           nan         nan    4.0e-05
    900      0.00e+00    0.00e+00           nan         nan    2.0e-05
   1000      0.00e+00    0.00e+00           nan         nan    1.0e-05
   1100      0.00e+00    0.00e+00           nan         nan    5.0e-06
   1200      0.00e+00    0.00e+00           nan         nan    2.5e-06
   1300      0.00e+00    0.00e+00           nan         nan    1.3e-06
   1400      0.00e+00    0.00e+00           nan         nan    6.3e-07
   1500      0.00e+00    0.00e+00           nan         nan    3.2e-07
   1600      0.00e+00    0.00e+00           nan         nan    1.6e-07
   1700      0.00e+00    0.00e+00           nan         nan    7.9e-08
   1800      0.00e+00    0.00e+00           nan         nan    4.0e-08
   1900      0.00e+00    0.00e+00           nan         nan    2.0e-08
   2000      0.00e+00    0.00e+00           nan         nan    1.0e-08

For comparison, lcurve.out from pt backend in the same setup:

# step    rmse_global_dipole_val rmse_global_dipole_trn   rmse_local_dipole_val rmse_local_dipole_trn   lr
# If there is no available reference data, rmse_*_{val,trn} will print nan
      1           nan         nan      4.69e+00    1.28e+00    1.0e-02
    100           nan         nan      3.10e-02    5.23e-02    1.0e-02
    200           nan         nan      3.14e-02    4.91e-02    5.0e-03
    300           nan         nan      2.94e-02    4.56e-02    2.5e-03
    400           nan         nan      2.75e-02    4.70e-02    1.3e-03
    500           nan         nan      2.75e-02    4.36e-02    6.3e-04
    600           nan         nan      2.75e-02    4.43e-02    3.2e-04
    700           nan         nan      2.74e-02    4.06e-02    1.6e-04
    800           nan         nan      2.71e-02    4.39e-02    7.9e-05
    900           nan         nan      2.73e-02    4.55e-02    4.0e-05
   1000           nan         nan      2.86e-02    4.03e-02    2.0e-05
   1100           nan         nan      2.82e-02    4.62e-02    1.0e-05
   1200           nan         nan      2.73e-02    4.28e-02    5.0e-06
   1300           nan         nan      2.86e-02    4.69e-02    2.5e-06
   1400           nan         nan      2.75e-02    4.50e-02    1.3e-06
   1500           nan         nan      2.69e-02    4.39e-02    6.3e-07
   1600           nan         nan      2.74e-02    4.50e-02    3.2e-07
   1700           nan         nan      2.62e-02    4.64e-02    1.6e-07
   1800           nan         nan      2.75e-02    4.25e-02    7.9e-08
   1900           nan         nan      2.79e-02    4.04e-02    4.0e-08
   2000           nan         nan      2.72e-02    4.25e-02    2.0e-08

Steps to Reproduce

cd deepmd-kit/examples/water_tensor/dipole
# setup dipole_input.json
dp train dipole_input.json

Further Information, Files, and Links

No response

@ChiahsinChu ChiahsinChu added the bug label Jan 6, 2025
@njzjz
Copy link
Member

njzjz commented Jan 6, 2025

It's related to atom vs atomic.

@njzjz
Copy link
Member

njzjz commented Jan 6, 2025

The filename is atomic_*.npy in v2. #3975 seemed to change it.

@anyangml
Copy link
Collaborator

anyangml commented Jan 6, 2025

The filename is atomic_*.npy in v2. #3975 seemed to change it.

I thought this should standardize the target key?

# standardize keys
data = {kk.replace("atomic", "atom"): vv for kk, vv in data.items()}
return data

maybe we need something similar in DeepmdDataSystem...

@ChiahsinChu
Copy link
Contributor Author

How about this: ChiahsinChu@e214ee9

Then the users do not need to change their conventions about naming.

@njzjz
Copy link
Member

njzjz commented Jan 6, 2025

The filename is atomic_*.npy in v2. #3975 seemed to change it.

I thought this should standardize the file name?

# standardize keys
data = {kk.replace("atomic", "atom"): vv for kk, vv in data.items()}
return data

maybe we need something similar in DeepmdDataSystem...

The filename is defined here

path = set_name / (key + ".npy")

from

for kk in self.data_dict.keys():
if self.data_dict[kk]["reduce"] is None:
data["find_" + kk], data[kk] = self._load_data(
set_name,
kk,

It seems to me that line 569 doesn't affect the filename to be loaded.

@anyangml
Copy link
Collaborator

anyangml commented Jan 6, 2025

correct, line 569 just makes it compatible for both atom_*.npy and atomic_*.npy, at least for pt backend.

@anyangml
Copy link
Collaborator

anyangml commented Jan 6, 2025

How about this: ChiahsinChu@e214ee9

Then the users do not need to change their conventions about naming.

I think this might work. @ChiahsinChu Can you create a PR with your fix?

@njzjz njzjz linked a pull request Jan 6, 2025 that will close this issue
github-merge-queue bot pushed a commit that referenced this issue Jan 7, 2025
Fix bug mentioned in
#4536

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Bug Fixes**
- Updated atomic property and weight label naming conventions across the
machine learning training and loss components to ensure consistent
terminology.
- Corrected placeholder key references in the training process to match
updated label names.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@anyangml anyangml closed this as completed Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants