Setting Up R for Bioinformatics Workflows

R bioconductor

A Windows user oriented guide to the initial set-up for R.

Feifei Li true
6-03-2021

I decided to write this because some of the steps in setting up R for the new version of Bioconductor could get painful on Windows, and they never teach you this in school. On UNIX/Linux getting things set up could simply take one or two commands.

R Setup

To readers who would like to follow the guide to set up R step-by-step, I recommend you set up the package directory first (see R Package Directory) if you have never done so.

R Version

We are going to use Bioconductor extensively in various bioinformatics workflows. Each version of Bioconductor has different requirements for R version (see Bioconductor). Here I decide to use the latest version of Bioconductor (till the date this post is created), which requires R version 4.1.0 or higher.

To check if you have the right version of R installed:

## Get the current version of R on your computer
Rver <- unlist(
    regmatches(R.version.string,
               regexec("(\\d+).(\\d+).(\\d+)", R.version.string)
    )
)[1]
## Update R if your R version is below the requirement
if (Rver <= "4.1.0") {
    if (!requireNamespace("installr", quietly = TRUE)) {
        if(!requireNamespace("devtools", quietly = TRUE)) {
            ## install installr from CRAN if no devtools
            install.packages("installr")
        } else {
            ## use devtools to install the latest installr version from GitHub
            devtools::install_github('talgalili/installr')
        }
    }
    installr::updateR()
}

requireNamespace(“installr”) checks whether the package installr was already installed in R before. It returns a boolean value indicating whether the package was installed, along with a message prompted in R console if you didn’t set quitely = TRUE. You might have seen many people uses require(), which also returns a boolean value. The difference is, require also automatically loads the package if installed, or downloads if not. This means using requireNamespace(“installr”) is always a better option for checking package installation. You should always avoid load().

I didn’t load the package with library(package.name) to use the function in the package, which is commonly what people do. Instead, I call the function by package.name::function.name(), and I encourage the readers to do this because it explicitly tells those people who are reading your code from which packages those functions come from. It might also help you understand better when you come back to look at your code. I prefer to keep things explicitly because it avoids confusions. However, there indeed exists exceptions where a package is required to be loaded using library(), which we will see later when we try to map HGNC symbols from Entrez ID’s.

Enough of lecturing, we come back to the R setup itself. Once the code above is run, you will be prompted to install the latest version of R if the current version of R on your computer does not meet the requirement. Once the installation is complete, restart RStudio. Don’t just use the Session > Restart R because it won’t switch to the latest installed R version in RStudio. In the new R session, check your package library directory with .libpaths() as the newly updated R will change it to the R version specific directory. If the User installation was selected during the installation, the directory could be C:$env:USERNAME-library. If the default system installation was selected, it could be C:Files.0. Sometimes it ends up in X:\{your R directory}.0.

You might want to change it if these are not what you want.

R Package Directory

For me, I have a dedicated directory for R packages. This way, I won’t need to suffer from re-installing or migrating R packages from the previous version of R after an update. To change the default R package directory, enter the following in PowerShell:

Add-Content C:\Users\$env:USERNAME\Documents\.Renviron R_LIBS="{path to your package directory}"
Add-Content C:\Users\$env:USERNAME\Documents\.Renviron R_LIBS_USER="{path to your package directory}\\user"
Add-Content C:\Users\$env:USERNAME\Documents\.Renviron R_LIBS_SITE="{path to your package directory}\\site"

Don’t forget double slashes in the directory path.

If this is your first time setting up a default package library, after the update you will have to re-install knitr and rmarkdown.

Bioconductor

A must-have for bioinformatics workflows.

BiocManager

Here we will install the latest release of Bioconductor 3.13:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.13")

It doesn’t get you the whole Bioconductor installed. This is because Bioconductor itself is not really “a piece of software,” but a collection of over 1000 R packages,(Gentleman et al. 2004) which takes forever to install on a single machine at a time. So BiocManager is more of a package manager like the CRAN project, just as its package name suggests.

Packages

Bioconductor follows a package release schedule different from CRAN, so to install Bioinformatics, we don’t use install.packages(), but instead, for example, to install edgeR, a package extensively used in processing RNA-seq data:

if (!requireNamespace("edgeR", quietly = TRUE))
    BiocManager::install("edgeR", ask = FALSE)

R Startup Behaviour

Rprofile.site file allows users to define the behaviours at starup and the end of an R session. R will source Rprofile.site at startup. On Windows, it is located in X:{path to your R installation}.0. It should be there. If not, just create an empty text file and rename it. In case you confuse it with .Rprofile, they are equivalent; the latter is the way they name it on UNIX/Linux.

Tab Width

RStudio by deafult uses a tab width of 2 spaces, which is consistent with the Google’s R Style Guide and the Tidyverse Style Guide. If you would like to follow the Bioconductor Style Guide, which uses 4 spaces, then add this to your Rprofile.site at the top:

options(tab.width = 4)

And welcome to the 4-tab camp:)

.First

.First is the function in Rprofile.site that actually allows you to define the startup behaviour of an R session:

.First <- function(){
    the first thing you want at startup
    the second thing you want at startup
}

Working Directory

R resets to a default “working directory” (most likely where you installed R) every time you open it, if your RStudio is not loaded with a R project .Rproj. It could get annoying that you have to set it to your own the working directory mannualy (if you have created one) with

.First <- function(){
    setwd("{path to your working directory}")
}

You can also load your helper functions from your utility scripts at startup:

.First <- function(){
    setwd("{path to your working directory}")
    source("{path to your utility scripts}")
}

.Last

Similar to .First, it defines the end behaviour of an R session.

Saving R Objects

To save a single R object (variables, functions, dataframe, etc.) as a file everytime closing an R session:

.Last <- function(){
    save(object_to_save, file = ".\\data\\name_of_object.RData")
}

where data is a directory to store data in the working directory, if you have created one.

And to load the object:

load(".\\data\\name_of_object.RData")

which could also be added to your .First if you need to use the object every time.

Writing at the end

I know Docker is a popular option out there for bioinformatics pipelines. It creates a consistent environment specific to the pipeline regardless of the operating system it is running on, and saves this hustle to configure R. The out-of-box feature is nice, but the downside is, Docker containers running in the background eats up computational resources (e.g. memory). In addition, if R runs in a Docker container, when communicating with some bioinformatics tools running on the host via their R API, some re-mapping of files or ports will be required.

Gentleman, Robert C, Vincent J Carey, Douglas M Bates, Ben Bolstad, Marcel Dettling, Sandrine Dudoit, Byron Ellis, et al. 2004. “Bioconductor: Open Software Development for Computational Biology and Bioinformatics.” Genome Biology 5 (10): 1–16.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/ff98li/ffli.dev, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Li (2021, June 3). ffli.dev: Setting Up R for Bioinformatics Workflows. Retrieved from https://www.ffli.dev/posts/2021-06-03-setting-up-r-for-bioinformatics-workflows/

BibTeX citation

@misc{li2021setting,
  author = {Li, Feifei},
  title = {ffli.dev: Setting Up R for Bioinformatics Workflows},
  url = {https://www.ffli.dev/posts/2021-06-03-setting-up-r-for-bioinformatics-workflows/},
  year = {2021}
}