The training does not converge, the loss goes up or stays up instead of going down
Please make sure to wait for a couple of epochs (~10). It is normal that strange things happen in the beginning. If that does not help, please reduce the learning rate by setting
param.HyperParameter.opt_param.lr. You may try it in steps of decreasing the learning rate by a factor of 0.5.
What is the deal with
Please refer to our extensive note on this in the Fitting notebook which you get by following the instructions in the Tutorial.
Errors and Software Issues#
I get errors when going through the example notebooks.
This might be due to a mismatch between your locally installed DECODE version and the version of the jupyter notebooks. The default version of this documentation points to the latest DECODE release with the respective latest example notebooks. We advise to update DECODE. See the update section below the installation instructions. You may need to redownload the most recent notebooks as well.
CUDA out of memoryerrors
This might happen if your GPU is
doing multiple things, i.e., used not only for computation but also for the display
old or has to little memory
If you have multiple GPU devices you may set:
1corresponds to the respective index of the device, starting with 0). If you don’t have multiple devices, you should try to reduce the batch size:
I get other CUDA errors, e.g.,
CUDA error: unspecified launch failure
Please check whether you have a somewhat up to date CUDA driver. It’s always a good idea to update it. Moreover, you can try to check which cudatoolkit version was installed by checking in the Terminal / Anaconda prompt
You can also try to pin the cudatoolkit version to another one by setting cudatoolkit=10.1 instead of plain cudatoolkit without version in the installation of the decode environment.
I get errors like
No CUDA capable device foundor CUDA driver issues.
This could mean that you really don’t have a CUDA capable device (e.g. only an AMD GPU), or that there are driver issues. Please check the following
import spline import torch print(torch.cuda.is_available()) print(spline.cuda_compiled) print(spline.cuda_is_available())
All above should return
True. When the first one returns
Falseit is likely that you experience a CUDA driver issue.
Training breaks due to
This can happen particularly often for Windows and there is no ‘one answer’. You might want to decrease the number of CPU workers or disable multiprocessing at all. For this you would start the training with a changed number of workers by adding
-w [number of workers]at the end of the python command. Specify
-w 0for disabling multiprocessing if even 2 lead to an error. Alternatively change the
param -> Hardware -> num_worker_train. Note that this can slow down training. You can also try changing the multiprocessing strategy, which you can do in the .yaml file:
param -> Hardware -> torch_multiprocessing_sharing_strategy. The sharing strategies depend on your system. Please have a look at Pytorch Multiprocessing.