You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NOTE: This discussion was originally posted to the ISSM Forum, which has been decommissioned. It is reproduced here for reference. Please feel free to contribute to this discussion as it seems that the original and follow up questions were not answered, or start a new discussion.
CHRISTIExy
Hi there,
I'm trying to run ISSM programs on Matlab, remotely on the Cedar server of Compute Canada. I added this line to step 3 in ~/trunk/examples/shakti/rumme.m:
But it failed to run after generating the following messages:
checking model consistency
marshalling file moulin.bin
uploading input file and queuing script
Enter passphrase for key '/Users/green/.ssh/id_rsa':
moulin-07-20-2023-15-30-18-49823.tar.gz 0% 0 0.0KB/s --:-- ETA
moulin-07-20-2023-15-30-18-49823.tar.gz 100% 182KB 2.0MB/s 00:00
launching solution sequence on remote cluster
Enter passphrase for key '/Users/green/.ssh/id_rsa':
Due to MODULEPATH changes, the following have been reloaded:
1) openmpi/4.0.3
cedar1.cedar.computecanada.ca
sbatch: error: Invalid --mail-type specification
ssh -l myusername cedar.computecanada.ca "cd /home/myusername/scratch/trunk/execution && rm -rf ./moulin-07-20-2023-15-30-18-49823 && mkdir moulin-07-20-2023-15-30-18-49823 && cd moulin-07-20-2023-15-30-18-49823 && mv ../moulin-07-20-2023-15-30-18-49823.tar.gz ./ && tar -zxf moulin-07-20-2023-15-30-18-49823.tar.gz && hostname && sbatch moulin.queue ": Signal 127
waiting for /home/myusername/scratch/trunk/execution/moulin-07-20-2023-15-30-18-49823/moulin.lock hold on... (Ctrl+C to exit)
But I'm not sure the file can actually be found by the function computecanada() , since in order to run the program, I'm required to assign all of the variables in the function as shown in the beginning.
I'm trying to figure out what went wrong. Do you have any suggestions? Thanks for your advice!
mathieumorlighem
Hi CHRISTIExy
it seems like the problem is sbatch: error: Invalid --mail-type specification. Make sure to set md.cluster.mailtype to something that the server supports (see https://slurm.schedmd.com/sbatch.html), maybe 'ALL'?
Cheers
Mathieu
CHRISTIExy
mathieumorlighem,
Thank you for your reply! I set the md.cluster.mailtype to 'ALL' and the problem's resolved. But there's another issue. Even though I'm sure I entered the correct passphrase, it keeps asking for it. I've set the ssh key, though I still need to enter my passphrase to access the server. Furthermore, the program doesn't appear to be running since the moulin.log or moulin.outlog files cannot be found. The following is a part of the feedback when running the program.
cedar1.cedar.computecanada.ca
Submitted batch job 7894033
sbatch: NOTE: Your memory request of 2048M was likely submitted as 2G. Please note that Slurm interprets memory requests denominated in G as multiples of 1024M, not 1000M.
waiting for /home/myusername/scratch/trunk/execution/moulin-07-21-2023-09-54-19-49823/moulin.lock hold on... (Ctrl+C to exit)
checking for job completion (time: 0 min 5 sec)
Enter passphrase for key '/Users/green/.ssh/id_rsa':
checking for job completion (time: 0 min 16 sec)
Enter passphrase for key '/Users/green/.ssh/id_rsa':
checking for job completion (time: 0 min 51 sec)
Enter passphrase for key '/Users/green/.ssh/id_rsa':
checking for job completion (time: 1 min 4 sec)
Enter passphrase for key '/Users/green/.ssh/id_rsa':
mathieumorlighem
Hi!
This behavior is actually expected but we can turn it off. In short, ISSM checks for job completion by looking for a .lock file in the execution directory on the cluster, and (by default) checks every 5 seconds. Since you have a passphrase, you probably don't want your local machine to do that. To turn it off you can set md.settings.waitonlock = 0. You will need to download the results manually once the job is done.
Best
Mathieu
CHRISTIExy
mathieumorlighem,
Thanks for your answer! This message doesn't appear again. But I ran into another problem. The moulin.outbin and moulin.outlog files didn't appear to exist. When I run the program on my local computer without using the cluster, this error doesn't occur though. Could you help me with this problem? Many thanks.
>> md=loadresultsfromcluster(md)
Enter passphrase for key '/Users/green/.ssh/id_rsa':
scp: /home/myusername/scratch/trunk/execution/moulin-07-24-2023-10-38-46-516//{moulin.outlog,moulin.errlog,moulin.outbin}: No such file or directory
Warning: issmscpin error message: could not scp moulin.outlog
In issmscpin (line 65)
In computecanada/Download (line 129)
In loadresultsfromcluster (line 47)
Warning: issmscpin error message: could not scp moulin.errlog
In issmscpin (line 65)
In computecanada/Download (line 129)
In loadresultsfromcluster (line 47)
Warning: issmscpin error message: could not scp moulin.outbin
In issmscpin (line 65)
In computecanada/Download (line 129)
In loadresultsfromcluster (line 47)
Warning:
Binary file moulin.outbin not found!
This typically happens when the run crashed.
Please check for error messages above or in the outlog
In loadresultsfromdisk (line 16)
In loadresultsfromcluster (line 50)
mathieumorlighem
Apologies... this is because I forgot to mention that there is a subtlety.... MATLAB will not remember how it named the directory where the job is running. The easiest is to follow these steps: https://issm.ess.uci.edu/trac/issm/wiki/nowaitlock let me know if you have any questions!
Mathieu
CHRISTIExy
I added this step 4 according to the website, but I'm not sure if it's correct.
if any(steps==4)
md=loadmodel('MoulinParam2');
md.cluster=computecanada('port', 0,'login', 'myusername', 'name', 'narval.computecanada.ca', 'time', 7, 'codepath', '/home/myusername/scratch/trunk/bin', 'executionpath', '/home/myusername/scratch/trunk/execution', 'projectaccount', 'def-nameofmyprofessor', 'mailtype', 'ALL');
md.settings.waitonlock = 0;
md.miscellaneous.name = 'moulin';
md=solve(md,'Transient','runtimename',false);
save MoulinTransient md
end
Once I enter the passphrase, It immediately asks me to use md=loadresultsfromcluster(md) to load results. The same error occurred when trying to load the result. I also tried to use 'loadonly' in the step 4 but it came with the same error as follows.
narval1.narval.calcul.quebec
Submitted batch job 19481855
Model results must be loaded manually with md=loadresultsfromcluster(md);
md=loadresultsfromcluster(md)
Enter passphrase for key '/Users/green/.ssh/id_rsa':
scp: /home/puxinyi/scratch/trunk/execution/moulin//{moulin.outlog,moulin.errlog,moulin.outbin}: No such file or directory
Warning: issmscpin error message: could not scp moulin.outlog
In issmscpin (line 65)
In computecanada/Download (line 129)
In loadresultsfromcluster (line 47)
Warning: issmscpin error message: could not scp moulin.errlog
In issmscpin (line 65)
In computecanada/Download (line 129)
In loadresultsfromcluster (line 47)
Warning: issmscpin error message: could not scp moulin.outbin
In issmscpin (line 65)
In computecanada/Download (line 129)
In loadresultsfromcluster (line 47)
Warning:
Binary file moulin.outbin not found!
This typically happens when the run crashed.
Please check for error messages above or in the outlog
In loadresultsfromdisk (line 16)
In loadresultsfromcluster (line 50)
mathieumorlighem
CHRISTIExy,
You forgot the most important part 🙂
one the job is completed, you should call the same code with 'loadonly',1 (which is an option of solve)
CHRISTIExy
mathieumorlighem,
Thank you for reminding me. I set loadonly = 0 in the first simulation, and then ran with loadonly = 1. But it also didn't work and generated the same error(file not found) as well. Is there anything wrong with the scripts?
if any(steps==4)
md=loadmodel('MoulinParam2');
% Specify that you want to run the model on your current computer
% Change the number of processors according to your machine (here np=4)
md.cluster=computecanada('port', 0,'login', 'myusername', 'name', 'narval.computecanada.ca', 'time', 7, 'codepath', '/home/myusername/scratch/trunk/bin', 'executionpath', '/home/myusername/scratch/trunk/execution', 'projectaccount', 'def-nameofmyprofessor', 'mailtype', 'ALL');
loadonly = 1;
md.settings.waitonlock = 0;
md.miscellaneous.name = 'moulin';
md=solve(md,'Transient','runtimename',false,'loadonly',loadonly);
if loadonly
save MoulinTransient md
end
mathieumorlighem
CHRISTIExy,
Are you sure that the job was completed? The path should be md.cluster.executionpath +/moulin: make sure you have an outlog, errlog and outbin file. If they are not there that means that something went wrong
Mathieu
CHRISTIExy
mathieumorlighem,
I found the problem in not being able to find the outbin, outlog and errlog files. When running the same program without using computecanada, only if I enter this script in Terminal:
source $ISSM_DIR/etc/environment.sh
cd /Applications/MATLAB_R2023a.app/bin
./matlab
It is only after opening matlab in this way that I can run the program properly, otherwise I get a .outbin not found error popping up. But I'm not sure how to avoid this error on computecanada. I tried 'loadonly' and made sure the program has finished running on computecanada:
[myusername@cedar5 trunk]$ squeue -j 8203199
JOBID USER ACCOUNT NAME ST TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON)
8203199 myusername def-xxx moulin R 19:58 1 8 N/A 2G cdr556 (Prolog)
[myusername @cedar5 trunk]$ squeue -j 8203199
JOBID USER ACCOUNT NAME ST TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON)
8203199 myusername def-xxx moulin CG 19:51 1 8 N/A 2G cdr556 (NonZeroExitCode)
[myusername @cedar5 trunk]$ squeue -j 8203199
JOBID USER ACCOUNT NAME ST TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON)
Nonetheless, I ran into the same problem:
scp: /home/puxinyi/scratch/trunk/execution/moulin-07-25-2023-10-18-49-28503//{moulin.outlog,moulin.errlog,moulin.outbin}: No such file or directory
Also, I checked the .errlog on computecanada:
cat /home/myusername/scratch/trunk/execution/moulin-07-25-2023-09-59-34-24631/moulin.errlog
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[0]PETSC ERROR: or try http://valgrind.org/ on GNU/linux and Apple MacOS to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.17.1, Apr 28, 2022
[0]PETSC ERROR: /home/myusername/scratch/trunk/bin/issm.exe on a named cdr556.int.cedar.computecanada.ca by myusername Tue Jul 25 10:00:24 2023
[0]PETSC ERROR: Configure options --prefix=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/petsc/3.17.1 --with-hdf5=1 --with-hdf5-dir=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/hdf5-mpi/1.10.6 --with-cxx-dialect=C++14 --with-memalign=64 --with-python=no --with-mpi4py=no --download-party=1 --download-superlu_dist=1 --download-SuiteSparse=1 --download-superlu=1 --download-metis=1 --download-ptscotch=1 --download-hypre=1 --download-spooles=1 --download-chaco=1 --download-strumpack=1 --download-spai=1 --download-parmetis=1 --download-slepc=1 --download-hpddm=1 --download-ml=1 --download-prometheus=1 --download-triangle=1 --download-mumps=1 --download-mumps-shared=0 --download-ptscotch-shared=0 --download-superlu-shared=0 --download-superlu_dist-shared=0 --download-parmetis-shared=0 --download-metis-shared=0 --download-ml-shared=0 --download-SuiteSparse-shared=0 --download-hypre-shared=0 --download-prometheus-shared=0 --download-spooles-shared=0 --download-chaco-shared=0 --download-slepc-shared=0 --download-spai-shared=0 --download-party-shared=0 --with-cc=mpicc --with-cxx=mpicxx --with-c++-support --with-fc=mpifort --CFLAGS="-O2 -ftree-vectorize -march=core-avx2 -fno-math-errno -fPIC" --CXXFLAGS="-O2 -ftree-vectorize -march=core-avx2 -fno-math-errno -fPIC -DOMPI_SKIP_MPICXX -DMPICH_SKIP_MPICXX" --FFLAGS="-O2 -ftree-vectorize -march=core-avx2 -fno-math-errno -fPIC" --with-mpi=1 --with-build-step-np=8 --with-shared-libraries=1 --with-debugging=0 --with-pic=1 --with-x=0 --with-windows-graphics=0 --with-scalapack=1 --with-scalapack-lib="[/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/scalapack/2.1.0/lib/libscalapack.a,libflexiblas.a,libgfortran.a]" --with-blaslapack-lib="[/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/flexiblas/3.0.4/lib/libflexiblas.a,libgfortran.a]" --with-hdf5=1 --with-hdf5-dir=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/hdf5-mpi/1.10.6 --with-fftw=1 --with-fftw-dir=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8
[0]PETSC ERROR: #1 User provided function() at unknown file:0
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
In: PMI_Abort(59, N/A)
slurmstepd: error: *** STEP 8202307.0 ON cdr556 CANCELLED AT 2023-07-25T10:00:25 ***
srun: error: cdr556: task 0: Exited with exit code 16
srun: Terminating StepId=8202307.0
Hope the information is useful in solving this issue. Your reply is highly appreciated.
mathieumorlighem
ok so you DO have an errlog on the supercomputer, but it is a segmentation fault (weird because you should also have a non empty outlog)? Have you ever been able to run anything involving ISSM on ComputeCanada?
CHRISTIExy
mathieumorlighem,
I run it on my personal computer. I try to run the SHAKTI model on Cedar server on Compute Canada to make it run faster.
mathieumorlighem
CHRISTIExy,
Could you try with one simple run, like test101 in test/NightlyRun/ ? Tell us what you get in the errlog and outlog
CHRISTIExy
mathieumorlighem,
I made the following changes to the test101 file:
test101
boundary conditions for stressbalance model: spc set as zero
no balancethickness.thickening_rate specified: values set as zero
uploading input file and queuing script
Enter passphrase for key '/Users/green/.ssh/id_rsa':
test101-07-26-2023-06-39-39-69827.tar.gz 0% 0 0.0KB/s --:-- ETA
test101-07-26-2023-06-39-39-69827.tar.gz 100% 78KB 280.6KB/s 00:00
launching solution sequence on remote cluster
Enter passphrase for key '/Users/green/.ssh/id_rsa':
Lmod is automatically replacing "intel/2020.1.217" with "gcc/9.3.0".
Due to MODULEPATH changes, the following have been reloaded:
1) blis/0.8.1 2) flexiblas/3.0.4 3) openmpi/4.0.3
narval1.narval.calcul.quebec
Submitted batch job 19559296
Model results must be loaded manually with md=loadresultsfromcluster(md);
Unrecognized field name "StressbalanceSolution".
Error in test101 (line 30)
(md.results.StressbalanceSolution.Vx),...
md.results
ans =
struct with no fields.
However, after deleting the md.cluster=computecanada() script, it ran without error. Is it possible that the error was triggered by files not found?
test101
boundary conditions for stressbalance model: spc set as zero
no balancethickness.thickening_rate specified: values set as zero
uploading input file and queuing script
launching solution sequence on remote cluster
Ice-sheet and Sea-level System Model (ISSM) version 4.22
(website: http://issm.jpl.nasa.gov/ contact: [issm@jpl.nasa.gov](mailto:issm@jpl.nasa.gov))
call computational core:
write lock file:
FemModel initialization elapsed time: 0.041831
Total Core solution elapsed time: 0.103404
Linear solver elapsed time: 0.072022 (70%)
Total elapsed time: 0 hrs 0 min 0 sec
md.results
ans =
struct with fields:
StressbalanceSolution: [1×1 struct]
mathieumorlighem
ok so it looks like it may actually be working since you get Submitted batch job 19559296 but ISSM did not wait for the results. Can you go check in your execution directory on the cluster see if you have a test101 directory and see what the outlog and errlog look like?
CHRISTIExy
mathieumorlighem,
Of course. Here is the test101 execution directory and the file content.
ls /home/myusername/scratch/trunk/execution/test101-07-26-2023-06-39-39-69827
slurm-19559296.out test101.bin test101.queue
test101-07-26-2023-06-39-39-69827.tar.gz test101.errlog test101.toolkits
cat /home/myusername/scratch/trunk/execution/test101-07-26-2023-06-39-39-69827/test101.errlog
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[0]PETSC ERROR: or try http://valgrind.org/ on GNU/linux and Apple MacOS to find memory corruption errors
[0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and run
[0]PETSC ERROR: to get more information on the crash.
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.17.1, Apr 28, 2022
[0]PETSC ERROR: /home/myusername/scratch/trunk/bin/issm.exe on a named nc10536.narval.calcul.quebec by myusername Wed Jul 26 09:44:53 2023
[0]PETSC ERROR: Configure options --prefix=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/petsc/3.17.1 --with-hdf5=1 --with-hdf5-dir=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/hdf5-mpi/1.10.6 --with-cxx-dialect=C++14 --with-memalign=64 --with-python=no --with-mpi4py=no --download-party=1 --download-superlu_dist=1 --download-SuiteSparse=1 --download-superlu=1 --download-metis=1 --download-ptscotch=1 --download-hypre=1 --download-spooles=1 --download-chaco=1 --download-strumpack=1 --download-spai=1 --download-parmetis=1 --download-slepc=1 --download-hpddm=1 --download-ml=1 --download-prometheus=1 --download-triangle=1 --download-mumps=1 --download-mumps-shared=0 --download-ptscotch-shared=0 --download-superlu-shared=0 --download-superlu_dist-shared=0 --download-parmetis-shared=0 --download-metis-shared=0 --download-ml-shared=0 --download-SuiteSparse-shared=0 --download-hypre-shared=0 --download-prometheus-shared=0 --download-spooles-shared=0 --download-chaco-shared=0 --download-slepc-shared=0 --download-spai-shared=0 --download-party-shared=0 --with-cc=mpicc --with-cxx=mpicxx --with-c++-support --with-fc=mpifort --CFLAGS="-O2 -ftree-vectorize -march=core-avx2 -fno-math-errno -fPIC" --CXXFLAGS="-O2 -ftree-vectorize -march=core-avx2 -fno-math-errno -fPIC -DOMPI_SKIP_MPICXX -DMPICH_SKIP_MPICXX" --FFLAGS="-O2 -ftree-vectorize -march=core-avx2 -fno-math-errno -fPIC" --with-mpi=1 --with-build-step-np=8 --with-shared-libraries=1 --with-debugging=0 --with-pic=1 --with-x=0 --with-windows-graphics=0 --with-scalapack=1 --with-scalapack-lib="[/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/scalapack/2.1.0/lib/libscalapack.a,libflexiblas.a,libgfortran.a]" --with-blaslapack-lib="[/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/flexiblas/3.0.4/lib/libflexiblas.a,libgfortran.a]" --with-hdf5=1 --with-hdf5-dir=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/hdf5-mpi/1.10.6 --with-fftw=1 --with-fftw-dir=/cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/MPI/gcc9/openmpi4/fftw-mpi/3.3.8
[0]PETSC ERROR: #1 User provided function() at unknown file:0
[0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is causing the crash.
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
In: PMI_Abort(59, N/A)
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
srun: error: nc10536: task 0: Exited with exit code 16
Could it be an issue with the configuration scripts?
mathieumorlighem
ouch! That's not good. Basically your installation of ISSM on the cluster does not work (you have a segmentation fault). It probably comes from a library conflict or something like that. Has anybody been able to install ISSM successfully on this cluster?
DominoAJones
Hi,
I'm also running into the no outlog, outbin, errorlog error.
I've tried the loadonly suggestion:
loadonly = 0;
%Make sure jobs are submitted without MATLAB waiting for job completion
md.settings.waitonlock = 0;
md.cluster.interactive = 0; %only needed if you are using the generic cluster
md.miscellaneous.name = 'KNS';
%Submit job or download results, make sure that there is no runtime name (that includes the date)
md=solve(md,'Stressbalance','runtimename',false,'loadonly',loadonly);
%Save model if necessary
loadonly = 1;
%Make sure jobs are submitted without MATLAB waiting for job completion
md.settings.waitonlock = 0;
md.cluster.interactive = 0; %only needed if you are using the generic cluster
md.miscellaneous.name = 'KNS';
%Submit job or download results, make sure that there is no runtime name (that includes the date)
md=solve(md,'Stressbalance','runtimename',false,'loadonly',loadonly);
md=loadresultsfromcluster(md);
save KNS md
I cannot find any errorlog, outlog, or outbin on the supercomputer (so there must be something else wrong, maybe with the way I compiled it?) No one else has used ISSM on this cluster. Would love any advice I can get!
mathieumorlighem
Hi DominoAJones,
what do you see printed to the screen?
Also, make sure to only call loadonly = 1 once the run is completed. I am not sure if you are running this locally or on a remote machine, but you have to wait until the .lock file is written (which indicates that the run is finished)
All the best
Mathieu
help wantedExtra attention is neededclusterRegarding building/running of ISSM/external packages on compute clusters
1 participant
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
NOTE: This discussion was originally posted to the ISSM Forum, which has been decommissioned. It is reproduced here for reference. Please feel free to contribute to this discussion as it seems that the original and follow up questions were not answered, or start a new discussion.
CHRISTIExy
Hi there,
I'm trying to run ISSM programs on Matlab, remotely on the Cedar server of Compute Canada. I added this line to step 3 in
~/trunk/examples/shakti/rumme.m:But it failed to run after generating the following messages:
I followed the instructions on (https://issm.ess.uci.edu/trac/issm/wiki/computecanada) and added a file named
cedar_settings.min$ISSM_DIR/src/m. I edited the following content.But I'm not sure the file can actually be found by the
function computecanada(), since in order to run the program, I'm required to assign all of the variables in the function as shown in the beginning.I'm trying to figure out what went wrong. Do you have any suggestions? Thanks for your advice!
mathieumorlighem
Hi CHRISTIExy
it seems like the problem is
sbatch: error: Invalid --mail-type specification. Make sure to setmd.cluster.mailtypeto something that the server supports (see https://slurm.schedmd.com/sbatch.html), maybe 'ALL'?Cheers
Mathieu
CHRISTIExy
mathieumorlighem,
Thank you for your reply! I set the
md.cluster.mailtypeto 'ALL' and the problem's resolved. But there's another issue. Even though I'm sure I entered the correct passphrase, it keeps asking for it. I've set the ssh key, though I still need to enter my passphrase to access the server. Furthermore, the program doesn't appear to be running since themoulin.logormoulin.outlogfiles cannot be found. The following is a part of the feedback when running the program.mathieumorlighem
Hi!
This behavior is actually expected but we can turn it off. In short, ISSM checks for job completion by looking for a
.lockfile in the execution directory on the cluster, and (by default) checks every 5 seconds. Since you have a passphrase, you probably don't want your local machine to do that. To turn it off you can setmd.settings.waitonlock = 0. You will need to download the results manually once the job is done.Best
Mathieu
CHRISTIExy
mathieumorlighem,
Thanks for your answer! This message doesn't appear again. But I ran into another problem. The
moulin.outbinandmoulin.outlogfiles didn't appear to exist. When I run the program on my local computer without using the cluster, this error doesn't occur though. Could you help me with this problem? Many thanks.mathieumorlighem
Apologies... this is because I forgot to mention that there is a subtlety.... MATLAB will not remember how it named the directory where the job is running. The easiest is to follow these steps: https://issm.ess.uci.edu/trac/issm/wiki/nowaitlock let me know if you have any questions!
Mathieu
CHRISTIExy
I added this step 4 according to the website, but I'm not sure if it's correct.
Once I enter the passphrase, It immediately asks me to use
md=loadresultsfromcluster(md)to load results. The same error occurred when trying to load the result. I also tried to use 'loadonly' in the step 4 but it came with the same error as follows.mathieumorlighem
CHRISTIExy,
You forgot the most important part 🙂
one the job is completed, you should call the same code with 'loadonly',1 (which is an option of
solve)CHRISTIExy
mathieumorlighem,
Thank you for reminding me. I set
loadonly = 0in the first simulation, and then ran withloadonly = 1. But it also didn't work and generated the same error(file not found) as well. Is there anything wrong with the scripts?mathieumorlighem
CHRISTIExy,
Are you sure that the job was completed? The path should be
md.cluster.executionpath +/moulin: make sure you have anoutlog,errlogandoutbinfile. If they are not there that means that something went wrongMathieu
CHRISTIExy
mathieumorlighem,
I found the problem in not being able to find the
outbin,outloganderrlogfiles. When running the same program without using computecanada, only if I enter this script in Terminal:It is only after opening matlab in this way that I can run the program properly, otherwise I get a
.outbinnot found error popping up. But I'm not sure how to avoid this error on computecanada. I tried 'loadonly' and made sure the program has finished running on computecanada:Nonetheless, I ran into the same problem:
Also, I checked the
.errlogon computecanada:Hope the information is useful in solving this issue. Your reply is highly appreciated.
mathieumorlighem
ok so you DO have an
errlogon the supercomputer, but it is a segmentation fault (weird because you should also have a non empty outlog)? Have you ever been able to run anything involving ISSM on ComputeCanada?CHRISTIExy
mathieumorlighem,
I run it on my personal computer. I try to run the SHAKTI model on Cedar server on Compute Canada to make it run faster.
mathieumorlighem
CHRISTIExy,
Could you try with one simple run, like
test101intest/NightlyRun/? Tell us what you get in theerrlogandoutlogCHRISTIExy
mathieumorlighem,
I made the following changes to the
test101file:When running, it generated this error:
However, after deleting the
md.cluster=computecanada()script, it ran without error. Is it possible that the error was triggered by files not found?mathieumorlighem
ok so it looks like it may actually be working since you get
Submitted batch job 19559296but ISSM did not wait for the results. Can you go check in your execution directory on the cluster see if you have atest101directory and see what the outlog and errlog look like?CHRISTIExy
mathieumorlighem,
Of course. Here is the
test101execution directory and the file content.Could it be an issue with the configuration scripts?
mathieumorlighem
ouch! That's not good. Basically your installation of ISSM on the cluster does not work (you have a segmentation fault). It probably comes from a library conflict or something like that. Has anybody been able to install ISSM successfully on this cluster?
DominoAJones
Hi,
I'm also running into the no
outlog,outbin,errorlogerror.I've tried the loadonly suggestion:
I cannot find any
errorlog,outlog, oroutbinon the supercomputer (so there must be something else wrong, maybe with the way I compiled it?) No one else has used ISSM on this cluster. Would love any advice I can get!mathieumorlighem
Hi DominoAJones,
what do you see printed to the screen?
Also, make sure to only call
loadonly = 1once the run is completed. I am not sure if you are running this locally or on a remote machine, but you have to wait until the.lockfile is written (which indicates that the run is finished)All the best
Mathieu
Beta Was this translation helpful? Give feedback.
All reactions