Okay … a user is asking for an R package to be installed on the cluster. Depending on whether this is a one-off library or a popular library, it sounds wise to me if we suggest the following:
if it is a library that is very popular (say, data.frame, or tidyverse) then we install it on our central software repo
if it is a one-off package, we ask the user to install it on his/her home dir. (crossing our fingers that this is something straightforward to do with install.packages command.
This is to provide a balance between space usage vs. administrator’s burden. How does this practice sound to you?
Now … the problem is, how will a user know whether this request is for a “one-off” or a “popular” package?
The distinction between popular and one-off is like beauty, it’s in the eye of the beholder and the average user probably thinks what they need right now defines “popular”. While this doesn’t answer your direct question, it is a recipe for avoiding it altogether.
Assumptions:
Disk space is cheaper than time spent debating the breaking point between “popular” and “one-off”
Anything that can be scripted can be forgotten until the script breaks.
From that follows:
# This is an R script to install all CRAN and BioConductor packages.
#
# Set up a repo to use
local({r <- getOption("repos"); r["CRAN"] <- "http://lib.stat.cmu.edu/R/CRAN"; options(repos=r)})
# Install BiocManager, if not installed.
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
# Install all available.
BiocManager::install(BiocManager::available(include_installed=FALSE), Ncpus = 16,
upgrade=TRUE,
ask=FALSE)
warnings()
# Count the installed packages.
pkg <- installed.packages()
names <- pkg[,1]
length(names)
This will take a while. I submit it as a job and let it run for a week. Failures will vary based on what dependent libraries the various packages find or don’t find but it’s re-runable after adding dependencies and it’ll just try to do the Right Thing™ each time it runs. After a few runs and adding dependencies it is pretty easy to get to
and eliminate a lot of generic “please install R package X …” requests. Saves me time, saves users time, everybody wins.
Having now quickly picked all the low hanging fruit, the remaining one-off fruit higher up in the tree can be taken care of on a case by case basis with the assumption that if someone asked for it, they probably need it and installing probably takes less time than debating whether to install it or not. Every few months a batch upgrade will bring all that up to date or when doing a one-off install R will prompt to upgrade and solve both problems at once.
Anticipating the follow up question “but griznog, with all those packages doesn’t it take a long time to start R?” That doesn’t seem to really be an issue for running R stuff, it finds libraries just fine with a lot of them installed (but it does take a while to count them all). Plus a significant subset of R users probably came to R from MATLAB and you could put a sleep(300) before every command they run and they’d think it was still super-fast.
The definition of “popular” is totally subjective and cannot be used as a criteria for installation. I’m of the opinion that the cluster should provide the base software (R/RStudio), and then empower users to create reproducible environments for the packages that they need. What does this mean? Provide examples and documentation for local install. Encourage users to use virtual environments. Encourage users to use containers that can be shared with others. If they need help to install a library for a package or create the container/environment, you got them.
We usually provide them system-wide because a lot of the R modules need additional requirements that a normal user would not know about. And compilation can be tricky sometimes, but our users can do whatever they want until they get stuck trying to install one in the local repository.