DUO for Multi-Factor Authentication on HPC Systems

toreliza · July 9, 2018, 1:36pm

We’re planning to implement multi-factor authentication (DUO specifically) on our HPC systems. It seems like it should be relatively straight-forward if user authentication at login (via the head/login nodes) is the only requirement. Does anyone have experience they can share? Also, we use SLURM for a scheduler; can compute nodes be configured to bypass authentication? Is this even necessary? And how is file transfer affected?

ric · October 21, 2019, 5:39pm

We use DUO on our login (and head) nodes. Since DUO is configured and enabled as a pam module on a per-node basis, you simply omit duo from your compute nodes. Connect to compute nodes can be done using ssh keys or whatever other mechanism you need.

ccoffey · October 21, 2019, 5:34pm

I’ll be interested in any experiences with folks that have rolled out this tech as we also will roll MFA with duo in the future.

jpedro · October 21, 2019, 7:24pm

Any file transfers will need to be authenticated using DUO. If for some reason this is not an option the specific account can be excluded from DUO authentication by an administrator.

akkornel · October 22, 2019, 12:06am

(NOTE: I’m going to use the term “two-step”, as most of what I say doesn’t apply just to Duo. But in our case, yes, we are using Duo).

Our compute environments all use SLURM. And we do indeed use two-step. We have several compute environments, but I’m going to focus on two. I’m also going to stick with technical stuff first.

For Sherlock, users may only connect to the login nodes directly (and admins to the login & head nodes). We enforce two-step here, via the pam_duo PAM module. Two-step is run after doing password (or GSSAPI) auth. It works fine. In the last few years of doing this, I can’t remember any Duo outage (which means the ones they’ve had have all been too short to remember). But it’s worth noting that we have very good connection out to the Internet.

When a user runs an interactive SLURM job, SLURM handles connecting the user to the login node. And for all jobs, users may SSH to any compute node running one or more of their jobs. When someone SSHes to a compute node, we use host-based auth to accept logins from the login nodes, and we use pam_slurm_adopt to put them into the SLURM job’s cgroups (or kick them off if they don’t have a job). It’s important to note that Sherlock compute nodes are separated from the rest of the campus network, so it is not possible to connect to them except through a login node (or a head node).

For sysadmins, they SSH to the head nodes using a different username, meant for root-level work. Auth is via GSSAPI (using a separate principal from our normal Kerberos principal). And two-step is applied at the time sudo is run. Doing it this way allows us to use a separate (and more-complex) password for root-level work, and the two-step at sudo-time means we can have a separate Duo configuration for sudo. This lets you, for example, point sudo two-step to a different Duo instance, or lets you have a different fail-open/fail-secure setting for sudo.

That’s Sherlock. FarmShare is a little different, at least for the users. For the admins, it’s the same: A separate account for logging in, followed by running sudo, with Duo called at sudo-time.

On FarmShare, the compute nodes are connected to the same network as the login nodes. We allow users to SSH directly to compute nodes, if they have a job running. This is really helpful for people who want to do things like Xvnc, because it means not having to tunnel through a login node. Because we allow SSH to the compute nodes from anywhere, they authenticate the same way as the login nodes: GSSAPI or password, followed by two-step, plus pam_slurm_adopt to check for a running job. All of that only applies if you SSH after starting a job; if you use srun --pty /bin/bash (or the like) to get an interactive job, that will connect you to the compute node with no additional authentication or two-step.

One thing to note: At this time, there is a weird interaction between two-step, pam_slurm_adopt, cgroups, and systemd, which can cause issues. See SLURM bug 5026. You should be prepared to disable pam_systemd on the compute nodes, if you haven’t already.

The note from jpedro is important to consider, as well. Enabling two-step on your cluster, depending on how you do it, can mean that all transfers (SCP, rsync, etc.) would require Duo.

To deal with data transfers, we have a dedicated DTN (data transfer node), which does not require two-step, and which also does not allow users to get a shell. It only supports SCP, SFTP, rsync, bbcp, and Globus.

If a user decides to do a file-transfer to a login node, we don’t explicitly block it, but users doing data transfer via a login node eventually reach out to us, asking for a way around doing two-step. Or their transfers get killed by (on Sherlock) login-node limitations on CPU-intensive programs, or (on FarmShare) cgroup-based login-node maximum CPU limitations. It’s at that point we tell them about the DTN (unless they learn about it themselves, via our docs).

Now, the non-technical side.

Assuming you are proposing a change to an existing environment, you are probably going to be making some people very unhappy. People who might’ve been doing automated logins (with SSH keys, or Kerberos keytabs, or whatever) won’t be able to do that anymore. And, in general, it will take longer for people to log in. You need to be prepared for the pushback, and that begins by having a firm policy to ground your decisions.

For us, we are operating under the Minimum Security Standards for Servers, specifically the standard that we “Require Duo two-step authentication for all interactive user and administrator logins.” This is not a policy we made, and so out ‘out’ is to refer people who complain to the Information Security group. Make sure you know who PIs can reach out to, if they want to complain about the policy.

It’s worth noting that our compute environments are just a few of the many things that moved to using two-step over the last few years. We weren’t the first things to start using two-step, so our users had already done the enrollment work, and although this didn’t eliminate complains, I do think it helped to reduce them.

Finally, we also look for areas where we can increase convenience while staying within policy. For example, we do not have two-step on the DTNs because we don’t view file-transfer as an “interactive…login”. As the DTNs are separate systems, we can use PAM, SSSD, and the like to enforce the “no shell” rule very well.

I think that’s it! If you’re interested in discussing more, feel free to stop by our booth at SC19!

lllowe · October 23, 2019, 2:38pm

We use DUO on login nodes. Compute nodes are not affected.

We had a transition period where folks who had not signed up for DUO could use one of the login nodes without DUO enabled on a ‘need to know’ basis on which node it was.

Folks using certain FTP programs (FileZilla and CyberDuck I think) had to go in and change some settings to avoid having to use DUO after every single file transfer. It is in the documentation for each… if you can’t find that, I can dig it up which preferences to open and settings to check, but probably better if you wait for that to happen because those programs are being updated to be more MFA friendly.