Galaxy@EMBL
Local Galaxy instance
At your side to solve your daily data management and NGS data analysis challenges
While Galaxy is the most accessible and efficient when performing standard analyses, more advanced statistical modeling or visualization usually requires specialized code, which can be written and executed using R on our RStudio Server instance.
Workflow modeling, with Galaxy and other Workflow Management System (WMS), to achieve better analysis automation and reproducibility is also in our area of expertise and we can provide advice and support to beginners-to-advanced researchers.
Galaxy is a web app that allows performing reproducible data analyses in a user-friendly graphical interface.
Everyone at EMBL has access to our Galaxy instance. Login happens with your EMBL credentials.
Galaxy can be used by anyone but reveals to be an incredible asset specifically for bench scientists with little computer knowledge, as it makes it easy to run the most commonly used bioinformatics software.
A variety of bioinformatics tools are available in a few clicks. This includes the most famous NGS, proteomics, and image data software. More can be deployed on-demand when specific interest is raised to us, do not hesitate to contact us.
Resource intensive jobs launched with Galaxy are automatically executed on on EMBL’s high-performance computing infrastucture (maintained by ITS). This means no additional hassle for researcher who never used a HPC cluster before.
A quota of 200Gb is allocated to each user. We encourage users to download useful analysis results to their group share when they are produced, and we expect users clean and purge useless data from their history in order to recover disk space.
Galaxy is not to be used as storage and we cannot guaranty the data will be kept for long-term.
This is the quickest option, but we do not recommend it for bigger files and/or files that are already stored on your group share, as this will unnecessarily hurt your quota and potentially duplicate data.
The data available on your group share at /g/<groupname>
can directly be linked to your Galaxy data library.
This is has to be done by an admin. To do so, please open a request with us, with the list of files that need to be made available. This avoid unnecessary data duplication, which saves your group resources (disk space and ?)
Connections have been established between our Galaxy instance and LabID, out data management platform. Datasets can be transferred from LabID to Galaxy in a few clicks (and without data duplication). Sending the data back from Galaxy to LabID is currently being beta testing. This allows to permanently store Galaxy’s analysis results and referencing it it lab notes, linking it to samples, annotations, protocols, and reagents, etc.
Local Galaxy instance
Collection of tutorials developed and maintained by the worldwide Galaxy community
Internal chatroom for our Galaxy users, to get advice and troubleshoot issues
RStudio – sometimes now referred to as Posit™ Workbench – is a powerful Integrated Development Environment for R, the go-to programming language for bioinformaticians and statistician aiming at extracting valuable information from experimental data.
Everyone at EMBL has access to our RStudio Server instance. Login happens with your EMBL credentials.
RStudio Server has access to the EMBL file system, including your group share, therefore you can directly access the data by referring to its path (e.g. on your group share).
The machine running RStudio Server is powerful but is a shared resources accessible by all EMBL scientists. Be mindful of others.
Each session is limited to 40Gb of memory. Please refrain from opening multiple session at once. We will kill your session if they jeopardize the work of others.
Resource intensive jobs have to be run on the cluster. This is specifically true when running parallelized jobs using multiple cores and a lot memory.
You typically have access to 3 different types of R install:
1. R from module bundled with bioconductor libraries (R-bundle-Bioconductor/<version>, e.g. R-bundle-Bioconductor/3.18-foss-2023a-R-4.3.2
)
2. R from modules (R/<version>, e.g. R/4.3.2-gfbf-2023a
)
3. R compiled from source (usually for the latest versions, currently R 4.3.3
)
R versions from modules ((1), (2)) have been optimised to run on our infrastructure and the same modules are available on RStudio, on Seneca, and on the cluster. The other R versions compiled from source are only available on RStudio and via command-line on Seneca.
You can list all the available module versions from a shell on Seneca (e.g. module avail R-bundle-Bioconductor
).
You cannot install your own version of R and use it within RStudio.
All versions are handled with the software framework used and maintained by ITS (Easybuild). New version can be installed by us or ITS providing the install recipe has been released by Easybuild and is available in their GitHub repository.
We also advise against installing your own R version locally or on Seneca with e.g. conda because this will critically limit the reproducibility of your analysis.
We encourage you to play around and install as many libraries as you want, however please consider the following:
An overview of the basic settings to run RStudio is available at https://git.embl.de/-/snippets/94.
Default install location for libraries is your home (~
) which disk space is limited by a quota.
By installing to many libraries, you will eventually hit the quota and start experimenting disk space errors. You can circumvent this issue by configuring R to install library somewhere else, for example in your group share. To achieve this, please create a .Renviron
file in your home folder (~/.Renviron
) and set the R_LIBS_USER
variable
R_LIBS_USER="/g/‹your_group›/‹your_username›/R-libs/%p/%V" # Which resolves to /g/‹your_group›/‹your_username›/R-libs/x86_64-pc-linux-gnu/4.2.1
Make sure to use the variables %p
and %V
(resolved respectively as the system architecture name and the R version) so that R adequately maintain version specific library install folders. This is important to not run into dependency conflicts when changing R versions.
Please update a pre-installed library only when needed (i.e. in case you are solving a dependency issue or when you know a newer version has a critical bug fix)
As explained above, R comes with the Bioconductor bundle and therefore has an extensive list of pre-installed libraries. Each is pinned to a specific version (the one listed in the Easybuild recipe). Updating all libraries – as sometimes advised by R – is not recommended here. It will download and install a newer version within your library folder, however the newer version will not be compiled in an optimised way as when we pre-installed it, and will therefore run less efficiently.
We advise to use the latest R-bundle-Bioconductor as it is a stable version of R that comes with a bunch of standard dependencies out-of-the-box (e.g. ggplot2
), which you therefore do not need to install on your own (saving us all time and storage and associated CO2 emissions). When using this bundle, we also advise against blindly updating any of R libraries included in the Bioconductor bundle: updating a library in effect install your own version of this library in your library folder (leading to “duplicated” install for not much benefit) and updates many dependencies (potentially causing compatibility issues for other packages relying on older dependencies’ versions). You may still need to update a handful of them when you attempt to install a new R package that depends on a newer version of a dependency.
When you use other R install, please consider:
– R modules (without bioconductor) miss bioconductor dependencies, however they still are bundled with a bunch of standard UNIX dependencies, and therefore should be used in priority compared to R versions compiled from source. Missing system dependencies can usually still be installed upon request.
– R compiled from source is to be used when you want to access the newest versions of R (the module version of R usually comes months after a R release). However, consider that it will lack even some of the most standard system libraries which are not installed on Seneca. We may be able to install them from the CentOS package manager, but we cannot guarantee we will be able to install all of them as we may end up in dependency conflicts, which is exactly what we are trying to avoid using modules.
4.3.2
)When installing your own libraries – and assuming you configured R_LIBS_USER
in your .Renviron
properly (see above) – your libraries install somewhere on your group share at a path like /<my_group_share>/
(you may check this by running <user>
/R/x86_64-pc-linux-gnu-library/4.3.2/Sys.getenv("R_LIBS_USER")
in the R). As visible on this path, the lib installs are separated by minor R version to avoid conflicts between R versions.
1. R packages installs are separated for different minor versions of R (i.e. a package install for 4.3.1
is separated from the install of the same package for 4.3.2
)
2. Two R versions install of the same minor R version (e.g. R/4.3.2-gfbf-2023a
and R-bundle-Bioconductor/3.18-foss-2023a-R-4.3.2
) do share the same library folder.
3. Your own install of an R package prevails over the module/bundle install.
This means you may encounter dependency conflicts when switching to a version that is not compatible with you own install of an R package. This will happen for example when using the R/4.3.2-gfbf-2023a
, installing a bioconductor package X and later switching to R-bundle-Bioconductor/3.18-foss-2023a-R-4.3.2
, which relies on a different version of the same package X. You will have to solve such dependency conflict on your own, which typically will mean removing your library folder for this version of R and re-installing what you need.
Problem: RStudio stores sessions information into user’s home directory at /home/<username>. This can lead to issue when hitting disk quotas (50Gb)
Solution: Move the ~/.local/share/rstudio
directory to another disk without quota (preferentially on seneca, or to your group share).
mkdir /tmpdata/$USER mv ~/.local/share/rstudio /tmpdata/$USER/rstudio && ln -s /tmpdata/$USER/rstudio ~/.local/share/rstudio
Cannot convert rmarkdown to pdf/html If you get errors related to 'X11 display' or 'Invalid argument', set this in your ~/.Rprofile: options(bitmapType='cairo') If, specifically, you want to generate PNG images inside HTML output, you can also use following Rmd preamble: --- output: html_document: dev: CairoPNG ---
Problem: Conversion of RMarkdown (.rmd) files to HTML or PDF fails with errors related to X11 display
or Invalid argument
Solution: Create or update the user ~/.Rprofile
to add the following line
options(bitmapType='cairo')
If, specifically, you want to generate PNG images inside HTML output, you can also use following Rmd preamble:
--- output: html_document: dev: CairoPNG ---
Problem: An R package using a specific/outdated Python version is producing dependencies conflicts with our default Python version.
Solution: Use the reticulate
package use_python()
, use_virtualenv()
, use_condaenv()
functions. For example:
library(reticulate) use_condaenv('my-project')
Local RStudio instance
Internal chatroom for our RStudio users, to get advice and troubleshoot issues
To achieve better automation and reproducibility of analysis, we much encourage the usage of analysis workflows and Workflow Management Systems (WMS).
We will assist less computer savvy colleagues in their standard NGS data analysis (RNA-seq, ChIP-seq, ATAC-seq, HiC, scRNA-seq…) by providing ready-to-use Galaxy workflows.
Non standard analysis workflows have to be developed by you, nevertheless we can teach you the basics of Galaxy so that you can assemble your own workflow in no time.
Our expertise in other domain than NGS is limited, however we help you with assembling your own workflow.
GBCS have regularly been providing training internally, and the Galaxy Training Network provide live material to learn by yourself. This covers a large area of domains, including sequencing, miscroscopy, proteomics, metabolomics, etc.
For bioinformaticians proficient with command line tools, we advise looking into command-line based WMS. The most commonly used at EMBL are Nextflow and Snakemake*.
(*) We cannot recommend one WMS over another. Snakemake and Nextflow are both powerful tool, and other WMS also exist out there. Picking the right tool is a hot topic in life sciences, many aspects are to be considered and the choice ultimately is up to you. However we at GBCS do have a better expertise on Nextflow.
When your group is part of the GB Unit, we can provide further support and collaborate on workflow development. This for example can either mean developing a custom Galaxy or Nextflow workflow, or collaborating on the development of a Nextflow workflow with bioinformaticians in your group in order to teach them best practice of software development with git and of modular workflow development.
We maintain a super computer named Seneca, which we use to run RStudio Server. This computer can be accessed via ssh and is connected to your group share. It can be used to run basic Unix commands and resource inexpensive processing.
/tmpdata
)Everyone at EMBL has access to Seneca. Login happens remotely via ssh to seneca.embl.de
(when connected to the EMBL network).
Seneca is configured as a SLURM submit host and therefore can be used to submit cluster jobs like login01.cluster.embl.de
or login02.cluster.embl.de
. Find more information on the ITS Cluster Wiki.
The majority of software and their versions are handled with Easybuild, the software framework used and maintained by ITS. Software is specifically compiled against the platform it’s running on and is therefore optimised. A specific version of a software – compiled by a specific toolchain – is referred to as an environment module. Modules are loaded in the user environment on demand, by the user themself, using the module
command. Loading a given module does load all the needed software dependencies with it.
Easybuild builds software modules. Linux comes with the module
command-line tool to interact with modules (we use Lmod), and typically load them into your environment, list the existing and/or loaded ones, etc.
module avail
lists all modules.module avail <string>
lists all module with <string>
in their name (case insensitive), e.g. module avail python
returns Python and IPython modules, etc.module spider
and module spider <string>
do a similar job.module load <module_name> [<module2_name> ...]
loads the given module(s), e.g. module load Python/3.10.8-GCCcore-12.2.0 SciPy-bundle/2023.02-gfbf-2022b
loads both Python and SciPy. Find names with the avail or spider commands.When possible, load matching toolchain versions, i.e. versions that have been compile with the same toolchain.
NB 1: When loading multiple modules and hitting a dependency conflict, the last loaded module wins, i.e. the last module that needs the dependency dictates the loaded version of said dependency.
module list
lists all the loaded module. Even after explicitly loading a single module, the list may contain multiple module. This is because loading a module means loading the given one and all the module it depends on. For example loading R-bundle-Bioconductor/3.16-foss-2022b-R-4.2.2
effectively loads R, Bioconductor, as well as 123 other dependencies.
module unload <module_name>
unloads a given module and all obsolete dependencies .module purge
unloads all the loaded module.
You cannot install your own software with Easybuild*.
When you identify a piece of software that is not available, you can request its install to us or to IT Services. On our side, installing should not take long, providing that either (1) an official easybuild recipe exists, or (2) that the install procedure is standard & following best practices.
As an alternative, you may also use virtual environments managers (like conda) but we provide only limited support for them.
* Effectively you could maintain your own Easybuild install, but this is advanced usage and out of scope of this document
The machine running RStudio Server is powerful but is a shared resources accessible by all EMBL scientists. Be mindful of others.
Do not run resource intensive jobs on this machine or they will be killed.