How many resources should I allocate for a typical Gaussian calculation? How do I assign them?
Note that this is assuming that you are running common density functional theory (DFT, e.g. B3LYP) (or Moller-Plesset (MP2)) methods. Gaussian does not scale well for coupled cluster (e.g. CCSD) or configuration interaction (CISD) methods.
For DFT methods, Gaussian will scale well up to 16 cores, with diminishing returns (or even losses!) past this point.
Memory allocations will depend on the size of your molecules. Large systems or systems that contain heavy atoms (more electrons) will require more memory. 256-1024 MB per CPU is generally the optimal range. Note that using too little or too much total memory can also be detrimental to the calculation speed.
Here are the results running a geometric optimization with DFT (B3LYP/6-31+G(d,p)) of a small sample molecule (aspirin) on a cluster:
In your Gaussian input file, you can specify these parameters with
%nprocshared=16
%mem=8gb
You should also pass these parameters to the cluster in the batch file with:
#!/bin/bash
#SBATCH -n 16
#SBATCH --mem=9000
Note that it is best to allocate slightly more memory on the cluster than Gaussian is instructed to use (~9GB instead of 8GB in this case).
Final notes:
If you’re using a cluster that has very small or heavily used nodes, you may wish to restrict the resources allocated, as running your Gaussian calculation on one node is virtually always advantageous compared to running the calculation across multiple nodes.
Additionally, requesting fewer resources will also allow your job to start more quickly and is less likely to be interrupted by higher priority jobs.
So, I need to open my mouth here because what the previous user has said promotes some misinformation that very nearly maligns Gaussian. And, more than that, I’ve seen this piece of misinformation parroted on Research Gate, which means that people are looking at this page, reading it, and then using it as if it were the gospel truth.
Truth of the matter is that Gaussian scales perfectly well to 24 CPUs or even 24 CPUs on a thousand nodes in a cluster… if you’re asking the right question. What the previous user wrote was a naive benchmarking on a relatively small molecule (Aspirin) and has scaled his argument to assume that every other molecule will work exactly the same way, maxing out performance at about 16 CPUs. This is completely false!
To understand why this is false, people should think a little more deeply about how an SCF calculation works. In SCF calculations, both Hartree-Fock and DFT, a series of equations dependent on the molecule are solved to self-consistency where you make a guess about an initial state, run an involved calculation based on this initial state to produce a final state, and then use the final state as the next initial state… you repeat the loop over and over comparing the initial state to the final state until the two become so similar as to be indistinguishable. At that point, you call the calculation converged and you stop. With Hartree-Fock (and Post-Hartree-Fock) the involved calculation is a huge number of 1- and 2-electron overlap integrals, while with DFT, the involved calculation is some complicated density functional calculated on a grid, which is then projected into a basis set. Both of these styles of calculation depend on a basis set that contains N functions. For Hartree-Fock, the number of calculations you have to do scales as something like N^4; for DFT, it depends on the functional, but something like N^3 or N^4; for Post-Hartree-Fock (CI, CCSD, MP2) this is more like N^5. In other words, the calculation is hugely non-linear relative to the size of the molecule and the quality of the calculation you wish to use.
Now, the computational overhead of working SCF is in a bimodal cycle; the computer does roughly two things that take time. The first thing is that it has to establish and organize the calculation, call it the overhead part of the calculation. The second thing is that it has to turn the crank, call this the computation part of the calculation. The expense of the overhead part of the calculation is very dependent on the amount of resources you have to marshal for performing the computation; the time taken may be something close to linear with respect to the number of processors (or nodes) you have to subdivide the task to fill. The expense of the computation, on the other hand, has to do with the size of the question you’re asking, and the size of the computation scales at around the N^4 compared to the size of the basis set, N.
You reach diminishing returns in parallel computing on this kind of task when you subdivide the computation task sufficiently that the time spent in the computation on individual CPUs is roughly the same as the time required to establish the overhead of marshaling the whole collection of CPUs into action. Note, the computation part scales way faster than the overhead for quantum chemistry. For a molecule as small as Aspirin, with a data bus and CPU and memory architecture similar to the guy who wrote the previous post, the roll-over into diminishing returns is apparently 16 CPUs. With aspirin and whatever basis our esteemed colleague pulled out, at 16 CPUs, you spend more time setting up processors than you do cranking out integrals. If you push the basis from 3-21G to cc-pVDZ to aug-cc-pV5Z on aspirin (or go from B3LYP to CAM-B3LYP), you’re going find the roll-over point moving to more processors, 40 or 100, across maybe multiple nodes. I’ve taken a 2 week job from a laptop with 6 processors and employed 10 nodes with 24 processors a piece in a cluster to crank through the job in a couple of hours --that’s a factor of about 40 to 80 improvement in time, not the factor of 2.7 as you would expect from 16 vs. 6, so definitely not diminishing returns after 16 processors.
Point is, if you go to a bigger molecule with a more complicated question, Gaussian can scale as well as technology allows it to turn the crank. And please, do not parrot what this guy is saying. He isn’t asking the right questions himself. If a task isn’t going well, it doesn’t always behoove the user to claim the tool doesn’t work… maybe the user just doesn’t know how to use the tool!
Very helpful and clear. How is it possible to determine what changes in number of cores and disk memory will improve the performance. We are running mp2 optimizations, 6-31++g(d,p), lots of molecules of varying sizes. We care about efficient use of the available nodes. e.g. run 1 job at a time at the max 36 cores, or 2 jobs at a time using 18 cores each, or…
Also you mentioned using multiple nodes. Does that mean you use Linda?
Thanks
My experience with MP2 is that it’s extremely expensive. I’ve tried running MP2 frequency calculations on a molecule also optimized with it. My experience is that the Gaussian freqmem utility can give you an estimation on the memory amount you should address to Gaussian for a particular job. I’ll often start with a Guess=only job in order to have statistics on the job for the number of basis functions relative to the atom number, which is needed input for freqmem. But, with MP2 this can end up being a bit of a security blanket. It also turned out to be really important to tell Gaussian the size of my scratch disk space using the MaxDisk instruction. Gaussian weighs whether to run the calculation as a direct-SCF (memory intensive, integrals done in-core), a classical SCF (disk intensive, integrals written to file) or some hybrid calculation based upon the size of the molecule, the amount of memory available and the size of the disk space available, in order to optimize the the performance of the program. I recommend reading the MP2 key word section at the Gaussian website for detailed hints (they have memory formulas for each different style of MP2 calculation at that site).
If you’re interested in efficiency of use, my recommendation is to think about minimizing how much of the optimization you need to perform with MP2. I will frequently do several phases of optimization: use semi-empirical AM1 for a quick pass on a structure, move to B3LYP/6-31+g(d) with DFT, then go to something computationally expensive. If you get to a pretty good structure with a less expensive technique first, you don’t need to run the MP2 for as many cycles. To be clear, this doesn’t always work and sometimes can’t be predicted, but it can change your DFT or MP2 time from 90 cycles to 15, which is useful. With regard to the question of how many cores to how many jobs, if there is no cross-talk, you have a computer of suitably infinite memory and disk space and you’re setting up these runs strictly by automation to start all simultaneously, running 36 jobs on 36 cores is going to be cheaper time wise than running 1 job at a time on 36 cores. That depends, of course, on whether you have the throughput to start that; doing it manually probably will make it uneconomical and for MP2 there’s going to be a minimum amount of other system resources necessary to even make one molecule run, let alone 36.
For post-HF methods, like MP2, the performance is pretty strongly dependent on every different aspect of the system. If you have the hardware to invoke the “incore” option, for instance, it can make the calculation much more painless, but this will depend on the size of your molecule and basis. You can go to Linda to help expand your resource pool, but in my experience, not every technique is painlessly Linda parallel, meaning that you may not be able to straddle multiple cores to expand your reach.
I’ve run with Linda for up to ten nodes. This can knock a huge amount of time off an optimization or an expensive frequency calculation. Usually you can see if Linda is economical based on how you change wall time when you use it; if switching from one node to two knocks the time approximately in half, your calculation is expensive enough that the set-up is insignificant compared to the time required to calculate. I believe that there’s information on the Gaussian website which tells what methods are Linda compatible or not.