Introducing myself - Micaela Chan

mychan · November 2, 2020, 3:59pm

I’m a postdoc at UT Dallas working on neuroimaging data. My focus is on how large-scale functional brain networks, as measured by MRI, changes with aging and other environmental factors.

Experience with HPC: Beside testing some of my work on Ganymede using the HPC system, I’ve mostly worked with Sun Grid Engines (SGE). But so far HPC seems fairly easy to pickup and use for serial or parallel jobs, and since it has more recent development and adoption rate than SGE, I am looking forward to using it more on other clusters as well.

My Workflow: Typically I process raw neuroimaging data through a preprocessing pipeline on a linux system that uses Sun Grid Engine for job scheduling. I usually then analyzed the data in R either locally on my mac or on the same linux system.

Challenge: One of the biggest bottle neck in our workflow is the processing time for each subjects’ neuroimaging data. It takes up to 24 hours for some data to completely process, so when we receive archival data with thousands of subjects/sessions, it becomes a long haul to process and qc them all.

sasmitam · November 2, 2020, 5:17pm

Welcome to the community!!!

csim · November 3, 2020, 5:48pm

Welcome Michaela!

How much of the workflow / pipeline requires manual intervention? Is the full pipeline automated or does it require a human to approve / check / munge data etc?

I know we’ve talked about this a bit but remind me again how many cores / how much RAM you need per “subject”?

IIRC, within a subject there is a strict order of operations but each subject is independent? That is, there’s nothing stopping us from exploiting the embarrassing parallel part and run 100s or 1000s of subjects at a time?

I’d say let’s start with getting your workflows running on europa. Then we will look at the scripting / automation to handle the subject parallelization efficiently. Then finally, we will look at ways of automating / optimizing the single subject run time (~24 hours).

Let us know when you’re up and running!

Best,
Chris

mychan · November 4, 2020, 6:27pm

Hey Chris, the workflow we have for preprocessing is currently split into 3 parts, and each part is automated with manual QC at the end. We typically don’t advance to the next part until after QC, so we don’t process the people who fail at earlier steps (This could be changed though, nothing stops us from stringing all of them together).

Let me get back to you about the RAM/Storage/Core, that sort of depends on the type of data, and I’m gathering a test-set to do.

csim · November 5, 2020, 6:40pm

Sounds good; once you have a test set, we can do some workflow and workload optimizations.

Look forward to it!