Two usage issues for Pumas/DeepPumas on a cluster

Hi Pumas team,

I have an academic cluster license for Pumas/DeepPumas. I have set it up successfully, but I am consistently running into two issues with the licensing manager.

  1. I can launch Pumas/DeepPumas on the login node and interactive nodes, but when I try to launch jobs (e.g. array jobs where I want to test a bunch of hyper-parameters/seeds), the license manager randomly (maybe on 5-30% of the jobs, it varies) throws
[ Info: Current license is invalid or expired
┌ Error: Cannot connect to licensing server. Access to license server is required for license verification
└ @ PumasLicenseManager /build/run/_work/DeepPumasSystemImages/DeepPumasSystemImages/julia_depot/packages/PumasLicenseManager/vZL3A/src/PumasLicenseManager.jl:140
┌ Error: Exiting...
└ @ PumasLicenseManager /build/run/_work/DeepPumasSystemImages/DeepPumasSystemImages/julia_depot/packages/PumasLicenseManager/vZL3A/src/PumasLicenseManager.jl:170
~                    

The cluster consists of two types of nodes – some with AMD CPUs, others with Intel CPUs. For some reason the license manager throws license invalid or expired on the nodes with AMD CPUs (all the time), but not nodes with Intel CPUs. Both have the same kernel versions (or so claimed by the cluster maintainers). I activated the license on a node with the Intel CPU.

Regarding the random failures in 5-30% of the jobs, I ran that only on the nodes with Intel CPUs and, from what I could ascertain, the nodes have identical architectures, some CPUs are overclocked (but I wouldn’t think it would matter?), kernel versions, OS and anything else that I checked, so I don’t understand why it’s throwing these license errors. Is it just an internet connectivity issue?

  1. The second issue is a bit more usage related. A lot of my use cases for running Pumas/DeepPumas on a cluster are to be able to run a bunch of different hyper-parameters/seeds concurrently. One way to do that (as given in the example in the docs) is to get an allocation on the cluster and spawn workers within the REPL via Distributed. However, if the runs are not interdependent, I can just use the scheduler of the cluster itself and submit array jobs where I would just do deeppumas --threads=16 some_script.jl as I would use julia --threads=16 some_script.jl and this works fine. I thought the this approach would be more time and resource efficient for lots of individual jobs that require relatively small amounts of compute (e.g. 16 cores and 8GB memory, but there are 100s of them). However, if I try launching such array jobs, I get
┌ Error: Error in License Manager
│   exception =
│    Error in LicenseManager (22): Could not handle error. Status code: 429. {"error_msg":"Number of API requests from IP Address has been exceeded. Please try again later."}
│
└ @ PumasLicenseManager /build/run/_work/DeepPumasSystemImages/DeepPumasSystemImages/julia_depot/packages/PumasLicenseManager/vZL3A/src/PumasLicenseManager.jl:156
┌ Error: Exiting...
└ @ PumasLicenseManager /build/run/_work/DeepPumasSystemImages/DeepPumasSystemImages/julia_depot/packages/PumasLicenseManager/vZL3A/src/PumasLicenseManager.jl:170

which is, in hindsight, expected. I am currently staggering the start of the individual jobs in the array job so as to not overwhelm the license manager. Is there anything I could do that in principle would avoid this issue? As large numbers of requests may still come in randomly, even with staggering, even though it’s less likely.

For issue 1, on the AMD nodes where you get the license manager error, do those nodes have internet access?

For issue 2:

  1. Are you submitting these as individual jobs, or as a single Slurm job array?
  2. What’s the size of your average Slurm job array?
  3. Can you send me the public IP addresses for your compute cluster? You can email them to me if you don’t want to post them publicly.

@DilumAluthge

For issue 1, yes the AMD nodes have internet access (checked by just ping -c 4 www.google.com)

For issue 2:
1. Submitting as a single Slurm job array (that’s where the large number of license manager requests come from I’d assume)
2. The largest one so far was 190 (shouldn’t be significantly larger in the future, not sure of the average, ran just a few so far)
3. Will check with our IT dept. and e-mail if they’re okay with me doing that

For issue 1, I have to look into this further. We may need to set up a brief call - I’ll email you.

For issue 2, as a short-term workaround, since you’re using Slurm job arrays, can you limit the number of simultaneous tasks? E.g. #SBATCH --array=1-200%20? That might help avoid rate-limiting issues.

For issue 2, the longer-term solution will likely require me to have your public IP addresses. Let me know if your IT department has given permission to share those with me.