FAQ#

Performance#

  • The training does not converge, the loss goes up or stays up instead of going down

    Please make sure to wait for a couple of epochs (~10). It is normal that strange things happen in the beginning. If that does not help, please reduce the learning rate by setting param.HyperParameter.opt_param.lr. You may try it in steps of decreasing the learning rate by a factor of 0.5.

  • What is the deal with Mirror2D?

    Please refer to our extensive note on this in the Fitting notebook which you get by following the instructions in the Tutorial.

Errors and Software Issues#

  • I get errors when going through the example notebooks.

    This might be due to a mismatch between your locally installed DECODE version and the version of the jupyter notebooks. The default version of this documentation points to the latest DECODE release with the respective latest example notebooks. We advise to update DECODE. See the update section below the installation instructions. You may need to redownload the most recent notebooks as well.

  • I get CUDA out of memory errors

    This might happen if your GPU is

    1. doing multiple things, i.e., used not only for computation but also for the display

    2. old or has to little memory

    If you have multiple GPU devices you may set: device='cuda:1' (where 1 corresponds to the respective index of the device, starting with 0). If you don’t have multiple devices, you should try to reduce the batch size: param.HyperParameter.batch_size.

  • I get other CUDA errors, e.g., CUDA error: unspecified launch failure

    Please check whether you have a somewhat up to date CUDA driver. It’s always a good idea to update it. Moreover, you can try to check which cudatoolkit version was installed by checking in the Terminal / Anaconda prompt conda list.

    You can also try to pin the cudatoolkit version to another one by setting cudatoolkit=10.1 instead of plain cudatoolkit without version in the installation of the decode environment.

  • I get errors like No CUDA capable device found or CUDA driver issues.

    This could mean that you really don’t have a CUDA capable device (e.g. only an AMD GPU), or that there are driver issues. Please check the following

    import spline
    import torch
    
    print(torch.cuda.is_available())
    print(spline.cuda_compiled)
    print(spline.cuda_is_available())
    

    All above should return True. When the first one returns False it is likely that you experience a CUDA driver issue.

  • Training breaks due to multiprocessing or broken pipe error.

    This can happen particularly often for Windows and there is no ‘one answer’. You might want to decrease the number of CPU workers or disable multiprocessing at all. For this you would start the training with a changed number of workers by adding -w [number of workers] at the end of the python command. Specify -w 0 for disabling multiprocessing if even 2 lead to an error. Alternatively change the .yaml file here: param -> Hardware -> num_worker_train. Note that this can slow down training. You can also try changing the multiprocessing strategy, which you can do in the .yaml file: param -> Hardware -> torch_multiprocessing_sharing_strategy. The sharing strategies depend on your system. Please have a look at Pytorch Multiprocessing.