How to achieve the best throughput of many parallel jobs?

I need to submit 5,000 parallel jobs. Each job takes 10 hours to complete if run using a single core. I parallelized it and it runs much faster if I use many cores (there is an embarrassingly parallel loop within with 1 million iterations). But I noticed that my waiting time in the queue is much longer than if I am using only 1 core for my job. So how should I determine the best number of cores I should request for my jobs, so they will start and finish as fast as possible.

The answer to this question depends on many factors including the following:

  1. Is there a limit to the number of jobs a user can submit on the cluster you are using?
  2. What is the time limit for the jobs on your cluster?
  3. Are there special queues dedicated to run parallel jobs?
  4. How much memory your jobs use

Based on the above you should adjust your submission script.
If for example you have a limit of 250 jobs per user, the maximum time per user is 5 days and the cluster has nodes with 20 cores, then you might want to run 250 jobs where each job will use 20 cores and each job will run 20 instances of your script in parallel.

However for each particular setup I would discuss your particular situation with your cluster system administrators and they will provide the best recommendation for your case.