Under what conditions should I use MPI to run jobs in parallel?

Is there a general rule for when to consider parallelizing a job? Such as, the amount of data to be processed, or the amount of time it takes to run in serial? Or do other criteria take precedence; even though a calculation takes a really long time to run on one node, does that necessarily mean it will benefit from parallelization?

Probably the most important criterion to consider when deciding whether or not to run a job in parallel is the ease with which the code can be parallelized. Some codes are not conducive to parallelization; for example, it relies on dependencies throughout) and are better left to run in serial. But, if the code has sections that can be run at the same time on different nodes (as in, these sections do not have interdependencies and can independently compute a value that can be shared later in the code), it’s probably a good idea to restructure the code in parallel. MPI (Message Passing Interface) is the prevalent mechanism with which to parallelize an HPC code. As its name suggests, it enables sharing of information among compute nodes (and dispersal of this information to places in the code that require it). Since more than one value can be calculated simultaneously, in theory the computation should complete in less time than it would if kept in serial form.

I realize this is an old topic, but a follow-on question I’d have to this is:

If I have a small problem that can fit on one node, is it generally better to use something like OpenMP, vs. using MPI? Would OpenMP have less overhead than MPI? Once you go beyond one node, my understanding is that you must use MPI.

I wouldn’t say it’s better exactly, but it is often easier. OpenMP is much easier to add to an existing code – adding MPI to an existing code generally requires much more substantial redesign.

On one node, most MPI implementations should use shared-memory to “transmit” the messages, but there will be overhead for preparing and managing the messages.

MPI is one common option for going beyond one node, but there are others.

I’d concur - OpenMP is much easier due to the memory model than MPI, but it can’t span nodes.

It’s also worth considering - for the more advanced users - things like memory bandwidth or cache requirements. You may not want to use all the cores on a node, and instead use some cores on multiple nodes to avoid saturating some aspect of the node and improve net performance.