Introducing myself - Lakitha Omal Harindha Wijeratne

Introduce myself - Lakitha Omal Harindha Wijeratne

I am a doctoral graduate student at the department of Physics working with Prof. David Lary. I currently perform experiments to calibrate environmental sensors using Big Data and Machine Learning Algorithms. Most of my time is spent on designing and building such sensor systems. As these sensor systems are currently being deployed within the DFW metroplex, I am focusing on studying and modelling air pollution using the data collected.
Our research group under Prof. Lary is evolved around Multi-scale Integrated Sensing and Simulation (MINTS) which is a multidisciplinary platform developing intelligent sensing systems. MINTS initiative is designed to provide commanders, environment officers, intelligence officers, physicians and the general public with actionable insights and situational awareness using data gained from multiple spatial and temporal data packages. And as a member I follow the same merits of public service and public awareness.

Experience with HPC
My experience with HPC is quite limited and I am hoping TRECIS will help me develop the skills to streamline, and enhance some of the systems I mentioned above using HPC.

My Workflow
Currently I am in the process of calibrating sensor systems developed at UTD using recommended reference sensors. Most of my code is written in matlab and some aspects of the code is in Python such as file monitoring. Typically I run my code on a local machine using a smaller ratio of data and once the code is tested I port it to a much more resourceful machine that can handle heavy data loads.

Challenge
Most of our target outputs (such as pm2.5) of our algorithms are heavily dependent on climate data such as temperature and pressure. And as such it’s essential we calibrate for at least 6 months (to cover for both Summer and Winter climate conditions). This meant that we had to develop a system which evolves over time. My current challenge is to design a system that can push out up to date calibrated data which evolves over time using Gaussian Process Regression (GPR) with Matlab.

1 Like

Hello Lakitha ! Welcome to TRECIS ask.ci community.

1 Like

Hey Lakitha,

matlab and miniconda are on europa.

[csim@europa ~]$ ml load miniconda
[csim@europa ~]$ which conda
/opt/ohpc/pub/unpackaged/apps/miniconda/4.8.3/bin/conda
[csim@europa ~]$ ml load matlab
[csim@europa ~]$ which matlab
/opt/ohpc/pub/unpackaged/apps/matlab/R2020a/bin/matlab

At this time, matlab is only available on compute nodes. In order to use matlab interactively, you can:

[css180002@europa ~]$ ml load matlab
[css180002@europa ~]$ srun -N1 -t 30:00 --pty /bin/bash
[css180002@compute-1-1-1 ~]$ matlab
MATLAB is selecting SOFTWARE OPENGL rendering.

                                                                < M A T L A B (R) >
                                                      Copyright 1984-2020 The MathWorks, Inc.
                                                  R2020a Update 1 (9.8.0.1359463) 64-bit (glnxa64)
                                                                   April 9, 2020

 
To get started, type doc.
For product information, visit www.mathworks.com.
1 Like

Thank you Professor,
As recommended, I will try the calibration scripts on individual nodes. The goal is to run each individual calibration on a unique node. I have set up a calibration script for 15 unique sensors. I am new to Slurm and I am trying to run 15 separate Matlab scripts on 15 unique nodes concurrently. Can you give some guidance on that. If you do have some documentation on how to work with Slurm scripts that would be of great help. I remember you giving a brief tutorial on Slurm on the introduction to Europa session. If you do have a recording of that, would it be possible to share it on ask.ci.
My Best,
Lakitha

Hello Lakitha, regarding working on Slurm scripts you should be familiar with the following online user manual
http://docs.oithpc.utdallas.edu/
Its a good one to begin with.
-Sasmita

1 Like

Thank You.

Slurm Script Update.
My goal in running slurm scripts for Matlab was to run 15 concurrent Matlab scripts in 15 separate nodes. The following is what I managed to get done. I did 15 separate slurm scripts similar to the following code:
utdNodesRun01.sh

#!/bin/bash

#SBATCH -J utdNodes01         # Job name
#SBATCH -o job.%j.out            # Name of stdout output file (%j expands to jobId)
#SBATCH -N 1                         # Total number of nodes requested
#SBATCH -n 16                        # Total number of mpi tasks requested
#SBATCH -t 24:00:00               # Run time (hh:mm:ss) - 24hours

ml load matlab
matlab -nodesktop -nodisplay -nosplash -r 'try utdNodesOptSolo2(1); catch; end; quit'

The only difference on each script was the job name index and the argument within the matlab function.
Then I created a shell script to submit all the jobs to the Slurm batch at once with the following script:
runSlurmScripts.sh

sbatch utdNodesRun01.sh
sbatch utdNodesRun02.sh
sbatch utdNodesRun03.sh
sbatch utdNodesRun04.sh
sbatch utdNodesRun05.sh
sbatch utdNodesRun06.sh
sbatch utdNodesRun07.sh
sbatch utdNodesRun08.sh
sbatch utdNodesRun09.sh
sbatch utdNodesRun10.sh
sbatch utdNodesRun11.sh
sbatch utdNodesRun12.sh
sbatch utdNodesRun13.sh
sbatch utdNodesRun14.sh
sbatch utdNodesRun15.sh

I am writing to inquire if there is a more elegant means of doing this and to know how I can have this running on a daily basis without any intervention. Something similar to a cron job.

Hello Lakitha,
I’m not sure about the matlab task. But I would modify the runSlurmScripts.sh as below

#!/bin/bash
for i in seq 1 15;
do
sbatch utdNodesRun$i.sh;
done

-Sasmita

Hey Guys,

Sorry been traveling.

Lakitha,

There’s two general ways you can go about solving this problem, using SLURM job arrays or using a tool along with slurm to micromanage the batch job running.

For job arrays

For the second class, the two most common are probably launcher and gnu parallel.

https://github.com/TACC/launcher for TACC launcher. Tomorrow I will install launcher on Europa as a module but you can also download and use it without needing admin access to setup / use.

I’ll also write up an example on using job arrays and an example on launcher and post it here as soon as I can.

Best,

Chris

1 Like

@lhw150030

Just wrote a post on using slurm job arrays.

HTC using SLURM Job Arrays

If you’d like any help converting your workloads to use SLURM job arrays, let me know on that thread and I’ll answer there and gradually turn that thread into a wiki page / better documentation.

@solj and I were talking about launcher earlier today as well. One of us will get it installed and a module set up today.

Thank You. I will use this.

1 Like

Thank you Professor.

TACC launcher is now available also

$ module load launcher

Will be a day or so until I can get some documentation posted.

3 Likes

Hi Professor, I am very new to the launcher and I’m trying to use it to run multiple matlab files on one node.
Here is my slurm file:

#!/bin/bash
#Simple SLURM script for submitting multiple serial
#jobs (e.g. parametric studies) using a script wrapper
#to launch the jobs.
#
#
#SBATCH -J
#SBATCH -N 1
#SBATCH -n 16
#SBATCH -p development
#SBATCH -o Parametric.%j.out
#SBATCH -e Parametric.%j.err
#SBATCH -t 00:15:00
#SBATCH -A <account_name>




module load launcher
export LAUNCHER_WORKDIR=my_work_directory
export LAUNCHER_JOB_FILE=my_matlab_file.m

$LAUNCHER_DIR/paramrun

I tried this code to run a text file as you have taught before and it works but it cannot open a matlab file in this way. I greatly appreciate if you can guide me about opening matlab files with launcher

Hey @arya073,

Are you trying to run on the Qiwei partition on Ganymede? Is this part of the testing for Dr. Zare? If so, my answer will differ (slightly) than how to do it on europa (and I can give you a specific answer as opposed to a more generic introduction to launcher).

Bet,
Chris

Hello Professor. Yes I’m working with the Qiwei partition on Ganymede with Dr. Zare

Ok, so on ganymede (and on UTD systems) we don’t use the -A the flag.

As written you’re submitting to the development partition (and those nodes do only have 16 cores) but what you want to do is submit to the Qiwei partition.

The two nodes in the Qiwei partition are 96 cores each.

In jobfile you do not want to put your matlab script there. You want to put a text file that is a list of the commands to launch using shell/bash syntax.

So, let’s say you wanted to run multiple matlab jobs that each use 16 cores on a single Qiwei partition node. First, you’d create your SLURM batch script to look something like below:

Note: 96/16 = 6 tasks per node (-n 6)

#!/bin/bash
#Simple SLURM script for submitting multiple serial
#jobs (e.g. parametric studies) using a script wrapper
#to launch the jobs.
#
#
#SBATCH -J
#SBATCH -N 1
#SBATCH -n 6
#SBATCH -p Qiwei
#SBATCH -o Parametric.%j.out
#SBATCH -e Parametric.%j.err
#SBATCH -t 48:00:00


module load launcher
export LAUNCHER_WORKDIR=
export LAUNCHER_JOB_FILE=jobfile

$LAUNCHER_DIR/paramrun

Then, your jobfile would looks something like below:

matlab -nodesktop -nodisplay -nosplash /path/to/script1.m
matlab -nodesktop -nodisplay -nosplash /path/to/script2.m
matlab -nodesktop -nodisplay -nosplash /path/to/script3.m
matlab -nodesktop -nodisplay -nosplash /path/to/script4.m
matlab -nodesktop -nodisplay -nosplash /path/to/script5.m
matlab -nodesktop -nodisplay -nosplash /path/to/script6.m

Finally, in your matlab scripts (scriptn.m) you need to tell matlab it is only allowed to use 16 tasks.

You should play with the number of simultaneous jobs you run and adjust the number of tasks you allow each matlab job to launch.

Try 6 matlab jobs each using 16 cores
Try 12 matlab jobs each using 8 cores
Try 24 matlab jobs each using 4 cores
etc.

I’ll be on a teams call with your PI on Monday. If you’d like to join us, we can discuss in much more detail.

Good luck and happy computing!

Chris

Thank you very much Professor, I have used your guidance and I have faced few other issues that I wanted to share with you.

  1. I wanted to use this scheme on TACC and like you said I wrote a text file for my launcher such as the following:

module load matlab
matlab -nodesktop -nodisplay -nosplash /path/to/script1.m
matlab -nodesktop -nodisplay -nosplash /path/to/script2.m
matlab -nodesktop -nodisplay -nosplash /path/to/script3.m
matlab -nodesktop -nodisplay -nosplash /path/to/script4.m

The problem is that the Launcher do not run the other lines simultaneously. Instead it stays at the first line until it finishes it then it goes to other lines.
2) The other issue is that I have used the slurm file for Ganymede like you instructed me. It still shows the number of cores are 16 and therefore it is not in the Qiwei node. My slurm file is this:

#!/bin/bash
#Simple SLURM script for submitting multiple serial
Jobs (e.g. parametric studies) using a script wrapper
#to launch the jobs.

#SBATCH -J
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -p Qiwei
#SBATCH -o Parametric.%j.out
#SBATCH -e Parametric.%j.err
#SBATCH -t 00:05:00

module load launcher
export LAUNCHER_WORKDIR=/home/sxa190130/NM_project
export LAUNCHER_JOB_FILE=my_test.txt

I just wanted to run a test to see if my code works or not so I put the run time to be short (5 minutes) and n=1.
Thank you very much for your help
Arya

That is because that’s what you asked launcher to do. :slight_smile:

Ok several things. First, the module load matlab command will go in your slurm script not in the jobfile. Second, it ran one at a time because you used -n = 1 (and thus you asked it to run one task at a time).

I assure you the Qiwei nodes have 96 cores.

Move the module load matlab into your sbatch script. Remove the module load command from your launcher jobfile. Set -n to a number greater than 1. This will control how many run at one time.

If you ask for 1 node and 1 task, it’s going to run one at a time.

Set -n = number of tasks you want to run at the same time. I assure you the example I wrote above works.

Thank you very much Professor