-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ReturnnTrainingJob: model
in config is an absolute path
#495
Comments
replacing
with
will not be resolving the path in the manager (which would trigger an error if anybody ever activated the check), but I am afraid that once the config is written to file, there is no way to change the path. One could write a relative path as you said, but many other paths (e.g. inputs to the job) in the config file will still break. This holds true for any Job and not just ReturnnTrainingJob. I guess there is no way to move any job dir while the job is running. Other idea: manually rerun the task |
This is more a general issue of absolute paths in Sisyphus, not only relevant for the ReturnnTrainingJob. For full transfer-ability during "active" jobs all paths need to be relative, which is currently not the case. A much bigger problem than with Returnn configs is actually with bliss corpora, because they are wrong across jobs, not only within a job. In general, I would consider this is not a bug, but more an inconvenient design decision. At least I never expected that you can move Sisyphus setups without manual intervention (on the contrary, I was surprised by how good it does work and how few thinks broke for my last setup). I think the first step needed towards relative setups would be a |
Actually, now that you mentioned incorrect bliss files: imagine a setup that does audio processing and then updates the corpus file to contain the paths to the processed audio files. No way Sisyphus could know that the file needs to be regenerated. We would need to implement specific move_job functions for each job specifically that would rerun specific tasks as needed. |
If all paths are relative, then this would not be so much a problem. Despite, in this example, I would argue that the job which generates the new Bliss corpus should also own the processed audio files, i.e. have a copy of them (or hardlink). |
Yes, but still if the Jobs output is
and in the corpus we write the audio path as |
I would say that we should fix RASR that the path is relative w.r.t. the bliss XML file. |
For RASR I think you can already specify relative paths within the bliss XML file. You would need to specify the audio dir.
Gets set here:
|
Fine, then bundle files.. those do not support rel. paths |
I would always argue, whenever some software does not support this properly, we should fix this. I'm also not sure whether this is really such a common problem. Despite Bliss or bundle files, what else is there? I guess some other similar training jobs referencing into themselves, e.g. |
i6_core/returnn/training.py
Line 223 in cf32146
Once you move the job dir to a new location, this will thus break.
More annoyingly, RETURNN automatically silently recursively creates non-existing directories for
model
, so it will not crash but still run without errors. When you moved the job with existing checkpoints, it will not find the old checkpoints but start training again from scratch. However, it will also overwrite the learning rates file, so afterwards, the old checkpoints can not really be used anymore (if you care about having a corresponding correct learning rates file), and the learning rates file will have mixed values from the old and new training run.I'm not sure if we consider this a bug that we have absolute paths here? We could fix this by using a relative path. There might be a number of similar issues here and probably in other jobs as well.
The text was updated successfully, but these errors were encountered: