Edit

Multimodal Open Data Integration Support

At your side to solve your daily data management and data analysis challenges

Data Analysis

While Galaxy is the most accessible and efficient when performing standard analyses, more advanced statistical modeling or visualization usually requires specialized code, which can be written and executed using R on our RStudio Server instance.

Workflow modeling, with Galaxy and other Workflow Management System (WMS), to achieve better analysis automation and reproducibility is also in our area of expertise and we can provide advice and support to beginners-to-advanced researchers.

Galaxy
RStudio Server
WMS & Support
Super Computer & Software

Galaxy is a web app that allows performing reproducible data analyses in a user-friendly graphical interface.

Accessibility

Everyone at EMBL has access to our Galaxy instance. Login happens with your EMBL credentials.
Galaxy can be used by anyone but reveals to be an incredible asset specifically for bench scientists with little computer knowledge, as it makes it easy to run the most commonly used bioinformatics software.

Bioinformatics software

A variety of bioinformatics tools are available in a few clicks. This includes the most famous NGS, proteomics, and image data software. More can be deployed on-demand when specific interest is raised to us, do not hesitate to contact us.

Computing is performed on our HPC cluster

Resource intensive jobs launched with Galaxy are automatically executed on on EMBL’s high-performance computing infrastucture (maintained by ITS). This means no additional hassle for researcher who never used a HPC cluster before.

Data access with FAIR principles

A quota of 200Gb is allocated to each user. We encourage users to download useful analysis results to their group share when they are produced, and we expect users clean and purge useless data from their history in order to recover disk space.

Galaxy is not to be used as storage and we cannot guaranty the data will be kept for long-term.

Upload data

This is the quickest option, but we do not recommend it for bigger files and/or files that are already stored on your group share, as this will unnecessarily hurt your quota and potentially duplicate data.

Access your group share data

The data available on your group share at /g/<groupname> can directly be linked to your Galaxy data library.
This is has to be done by an admin. To do so, please open a request with us, with the list of files that need to be made available. This avoid unnecessary data duplication, which saves your group resources (disk space and ?)

From and to LabID

Connections have been established between our Galaxy instance and LabID, out data management platform. Datasets can be transferred from LabID to Galaxy in a few clicks (and without data duplication). Sending the data back from Galaxy to LabID is currently being beta testing. This allows to permanently store Galaxy’s analysis results and referencing it it lab notes, linking it to samples, annotations, protocols, and reagents, etc.

Get Support

Galaxy@EMBL

Local Galaxy instance

Galaxy Training Network

Collection of tutorials developed and maintained by the worldwide Galaxy community

Galaxy Chat

Internal chatroom for our Galaxy users, to get advice and troubleshoot issues

RStudio – sometimes now referred to as Posit™ Workbench – is a powerful Integrated Development Environment for R, the go-to programming language for bioinformaticians and statistician aiming at extracting valuable information from experimental data.

Accessibility

Everyone at EMBL has access to our RStudio Server instance. Login happens with your EMBL credentials.

Data access

RStudio Server has access to the EMBL file system, including your group share, therefore you can directly access the data by referring to its path (e.g. on your group share).

Limitations and good practices

The machine running RStudio Server is powerful but is a shared resources accessible by all EMBL scientists. Be mindful of others.

Each session is limited to 40Gb of memory. Please refrain from opening multiple session at once. We will kill your session if they jeopardize the work of others.
Resource intensive jobs have to be run on the cluster. This is specifically true when running parallelized jobs using multiple cores and a lot memory.

R (and Bioconductor) versions

You typically have access to 3 different types of R install:

1. R from module bundled with bioconductor libraries (R-bundle-Bioconductor/<version>, e.g. R-bundle-Bioconductor/3.18-foss-2023a-R-4.3.2)

2. R from modules (R/<version>, e.g. R/4.3.2-gfbf-2023a)

3. R compiled from source (usually for the latest versions, currently R 4.3.3)

R versions from modules ((1), (2)) have been optimised to run on our infrastructure and the same modules are available on RStudio, on Seneca, and on the cluster. The other R versions compiled from source are only available on RStudio and via command-line on Seneca.

You can list all the available module versions from a shell on Seneca (e.g. module avail R-bundle-Bioconductor).

Note on installing new versions

You cannot install your own version of R and use it within RStudio.

All versions are handled with the software framework used and maintained by ITS (Easybuild). New version can be installed by us or ITS providing the install recipe has been released by Easybuild and is available in their GitHub repository.

We also advise against installing your own R version locally or on Seneca with e.g. conda because this will critically limit the reproducibility of your analysis.

Library and package install

We encourage you to play around and install as many libraries as you want, however please consider the following:

1. Properly configure the install location of libraries

An overview of the basic settings to run RStudio is available at https://git.embl.de/-/snippets/94.

Default install location for libraries is your home (~) which disk space is limited by a quota.

By installing to many libraries, you will eventually hit the quota and start experimenting disk space errors. You can circumvent this issue by configuring R to install library somewhere else, for example in your group share. To achieve this, please create a .Renviron file in your home folder (~/.Renviron) and set the R_LIBS_USER variable

R_LIBS_USER="/g/‹your_group›/‹your_username›/R-libs/%p/%V"
# Which resolves to /g/‹your_group›/‹your_username›/R-libs/x86_64-pc-linux-gnu/4.2.1

Make sure to use the variables %p and %V (resolved respectively as the system architecture name and the R version) so that R adequately maintain version specific library install folders. This is important to not run into dependency conflicts when changing R versions.

2. Do not update all pre-installed Bioconductor libraries

Please update a pre-installed library only when needed (i.e. in case you are solving a dependency issue or when you know a newer version has a critical bug fix)

As explained above, R comes with the Bioconductor bundle and therefore has an extensive list of pre-installed libraries. Each is pinned to a specific version (the one listed in the Easybuild recipe). Updating all libraries – as sometimes advised by R – is not recommended here. It will download and install a newer version within your library folder, however the newer version will not be compiled in an optimised way as when we pre-installed it, and will therefore run less efficiently.

3. Libraries install are specific to a given R version

We advise to use the latest R-bundle-Bioconductor as it is a stable version of R that comes with a bunch of standard dependencies out-of-the-box (e.g. ggplot2), which you therefore do not need to install on your own (saving us all time and storage and associated CO2 emissions). When using this bundle, we also advise against blindly updating any of R libraries included in the Bioconductor bundle: updating a library in effect install your own version of this library in your library folder (leading to “duplicated” install for not much benefit) and updates many dependencies (potentially causing compatibility issues for other packages relying on older dependencies’ versions). You may still need to update a handful of them when you attempt to install a new R package that depends on a newer version of a dependency.

When you use other R install, please consider:

– R modules (without bioconductor) miss bioconductor dependencies, however they still are bundled with a bunch of standard UNIX dependencies, and therefore should be used in priority compared to R versions compiled from source. Missing system dependencies can usually still be installed upon request.

– R compiled from source is to be used when you want to access the newest versions of R (the module version of R usually comes months after a R release). However, consider that it will lack even some of the most standard system libraries which are not installed on Seneca. We may be able to install them from the CentOS package manager, but we cannot guarantee we will be able to install all of them as we may end up in dependency conflicts, which is exactly what we are trying to avoid using modules.

Important aspects when jumping between the different R installation types of the same R minor version (e.g. `4.3.2`)

When installing your own libraries – and assuming you configured R_LIBS_USER in your .Renviron properly (see above) – your libraries install somewhere on your group share at a path like /<my_group_share>/<user>/R/x86_64-pc-linux-gnu-library/4.3.2/ (you may check this by running Sys.getenv("R_LIBS_USER") in the R). As visible on this path, the lib installs are separated by minor R version to avoid conflicts between R versions.

1. R packages installs are separated for different minor versions of R (i.e. a package install for 4.3.1 is separated from the install of the same package for 4.3.2)

2. Two R versions install of the same minor R version (e.g. R/4.3.2-gfbf-2023a and R-bundle-Bioconductor/3.18-foss-2023a-R-4.3.2) do share the same library folder.

3. Your own install of an R package prevails over the module/bundle install.

This means you may encounter dependency conflicts when switching to a version that is not compatible with you own install of an R package. This will happen for example when using the R/4.3.2-gfbf-2023a, installing a bioconductor package X and later switching to R-bundle-Bioconductor/3.18-foss-2023a-R-4.3.2, which relies on a different version of the same package X. You will have to solve such dependency conflict on your own, which typically will mean removing your library folder for this version of R and re-installing what you need.

Troubleshooting

Exceeding home disk space quota

Problem: RStudio stores sessions information into user’s home directory at /home/<username>. This can lead to issue when hitting disk quotas (50Gb)

Solution: Move the ~/.local/share/rstudio directory to another disk without quota (preferentially on seneca, or to your group share).

mkdir /tmpdata/$USER
mv ~/.local/share/rstudio /tmpdata/$USER/rstudio && ln -s /tmpdata/$USER/rstudio ~/.local/share/rstudio

Conversion of RMarkdown to HTML or PDF fails

Cannot convert rmarkdown to pdf/html
If you get errors related to 'X11 display' or 'Invalid argument', set this in your ~/.Rprofile:
options(bitmapType='cairo')
If, specifically, you want to generate PNG images inside HTML output, you can also use following Rmd preamble:
---
output:
  html_document:
    dev: CairoPNG
---

Problem: Conversion of RMarkdown (.rmd) files to HTML or PDF fails with errors related to X11 display or Invalid argument

Solution: Create or update the user ~/.Rprofile to add the following line

options(bitmapType='cairo')

If, specifically, you want to generate PNG images inside HTML output, you can also use following Rmd preamble:

---
output:
  html_document:
    dev: CairoPNG
---

Use a specific version of python

Problem: An R package using a specific/outdated Python version is producing dependencies conflicts with our default Python version.

Solution: Use the reticulate package use_python(), use_virtualenv(), use_condaenv() functions. For example:

library(reticulate)
use_condaenv('my-project')

Get Support

RStudio@EMBL

Local RStudio instance

RStudio Server Chat

Internal chatroom for our RStudio users, to get advice and troubleshoot issues

To achieve better automation and reproducibility of analysis, we much encourage the usage of analysis workflows and Workflow Management Systems (WMS).

Galaxy

Next Generation Sequencing (NGS) data analysis

We will assist less computer savvy colleagues in their standard NGS data analysis (RNA-seq, ChIP-seq, ATAC-seq, HiC, scRNA-seq…) by providing ready-to-use Galaxy workflows.
Non standard analysis workflows have to be developed by you, nevertheless we can teach you the basics of Galaxy so that you can assemble your own workflow in no time.

Our expertise in other domain than NGS is limited, however we help you with assembling your own workflow.

Training

MODIS have regularly been providing training internally, and the Galaxy Training Network provide live material to learn by yourself. This covers a large area of domains, including sequencing, miscroscopy, proteomics, metabolomics, etc.

Command-line based WMS

For bioinformaticians proficient with command line tools, we advise looking into command-line based WMS. The most commonly used at EMBL are Nextflow and Snakemake*.

(*) We cannot recommend one WMS over another. Snakemake and Nextflow are both powerful tool, and other WMS also exist out there. Picking the right tool is a hot topic in life sciences, many aspects are to be considered and the choice ultimately is up to you. However we at MODIS do have a better expertise on Nextflow.

(GB Unit) Custom analysis & long-term collaboration

When your group is part of the GB Unit, we can provide further support and collaborate on workflow development. This for example can either mean developing a custom Galaxy or Nextflow workflow, or collaborating on the development of a Nextflow workflow with bioinformaticians in your group in order to teach them best practice of software development with git and of modular workflow development.

Get Support

We maintain a super computer named Seneca, which we use to run RStudio Server. This computer can be accessed via ssh and is connected to your group share. It can be used to run basic Unix commands and resource inexpensive processing.

Specifications

Dell Power Edge R7425
64 cores capable of 128 concurrent threads (2x AMD EPYC 7601 2,20GHz/2,7GHz, 32C/64T, 64M Cache (180W) DDR4-2666)
2Tb RAM (32x 64GB LRDIMM, 2666MT/s)
3.2Tb local storage Flash Disk (/tmpdata)

Accessibility

Everyone at EMBL has access to Seneca. Login happens remotely via ssh to seneca.embl.de (when connected to the EMBL network).

Cluster access

Seneca is configured as a SLURM submit host and therefore can be used to submit cluster jobs like login01.cluster.embl.de or login02.cluster.embl.de. Find more information on the ITS Cluster Wiki.

Software

The majority of software and their versions are handled with Easybuild, the software framework used and maintained by ITS. Software is specifically compiled against the platform it’s running on and is therefore optimised. A specific version of a software – compiled by a specific toolchain – is referred to as an environment module. Modules are loaded in the user environment on demand, by the user themself, using the module command. Loading a given module does load all the needed software dependencies with it.

Modules basics

Easybuild builds software modules. Linux comes with the module command-line tool to interact with modules (we use Lmod), and typically load them into your environment, list the existing and/or loaded ones, etc.

List available modules

module avail lists all modules.
module avail <string> lists all module with <string> in their name (case insensitive), e.g. module avail python returns Python and IPython modules, etc.
module spider and module spider <string> do a similar job.

Load a module

module load <module_name> [<module2_name> ...] loads the given module(s), e.g. module load Python/3.10.8-GCCcore-12.2.0 SciPy-bundle/2023.02-gfbf-2022b loads both Python and SciPy. Find names with the avail or spider commands.

When possible, load matching toolchain versions, i.e. versions that have been compile with the same toolchain.

NB 1: When loading multiple modules and hitting a dependency conflict, the last loaded module wins, i.e. the last module that needs the dependency dictates the loaded version of said dependency.

List loaded modules

module list lists all the loaded module.

Even after explicitly loading a single module, the list may contain multiple module. This is because loading a module means loading the given one and all the module it depends on. For example loading R-bundle-Bioconductor/3.16-foss-2022b-R-4.2.2 effectively loads R, Bioconductor, as well as 123 other dependencies.

Unload module(s)

module unload <module_name> unloads a given module and all obsolete dependencies .
module purge unloads all the loaded module.

Note on installing new software

You cannot install your own software with Easybuild^*.

When you identify a piece of software that is not available, you can request its install to us or to IT Services. On our side, installing should not take long, providing that either (1) an official easybuild recipe exists, or (2) that the install procedure is standard & following best practices.

As an alternative, you may also use virtual environments managers (like conda) but we provide only limited support for them.

* Effectively you could maintain your own Easybuild install, but this is advanced usage and out of scope of this document

Limitations and good practices

The machine running RStudio Server is powerful but is a shared resources accessible by all EMBL scientists. Be mindful of others.

Do not run resource intensive jobs on this machine or they will be killed.

Edit

Data Analysis

Galaxy

Accessibility

Bioinformatics software

Computing is performed on our HPC cluster

Data access with FAIR principles

Upload data

Access your group share data

From and to LabID

Get Support

RStudio Server

Accessibility

Data access

Limitations and good practices

R (and Bioconductor) versions

Library and package install

Important aspects when jumping between the different R installation types of the same R minor version (e.g. 4.3.2)

Troubleshooting

Get Support

WMS & Support

Galaxy

Next Generation Sequencing (NGS) data analysis

Training

Command-line based WMS

(GB Unit) Custom analysis & long-term collaboration

Get Support

Super Computer & Software

Accessibility

Cluster access

Software

List available modules

Load a module

List loaded modules

Unload module(s)

Limitations and good practices

Important aspects when jumping between the different R installation types of the same R minor version (e.g. `4.3.2`)