Open Source Software – Open Science at EMBL

The essential requirements for software at EMBL are:

Use Open Source Software by default.
Share your software in a public software repository.

Many scientific methods and services are implemented in terms of computer code, and many scientific data analyses or models are performed by the execution of computer code. Textual (e.g., English language or mathematical formalism) descriptions of such steps are helpful and desirable for good scientific practice, but typically (esp. with ‘big data’ or complex algorithms) they are not sufficiently unambiguous, and the actual software that was used to program the computer is the only definitive means of enabling reproduction and reuse. The aim of the policy is to make provision of such code the norm for all research outputs and services by EMBL.

Computer code is also used in a large number of other ways (e.g., powering the operating system of a piece of hardware, office applications, etc.), and therefore this policy singles out four specific categories of software for which EMBL has an expectation of openness. Other categories of software, such as those used for engineering and business reasons (e.g. Microsoft Office, Oracle, Google) are not in the scope of this policy. In other words, the policy is concerned with original or custom-written code with scientific purpose, not with widely available (incl. commercial) software.

Categorisation

The four categories are:

Scientific results. Code that is needed to reproduce contents of a scientific publication, such as results from a data analysis, or the demonstration of a method’s performance.
Services. Software that supports EMBL services or the data generation at EMBL facilities.
Methods. Broad purpose tools and methods, sometimes a.k.a. packages, libraries. In practice, this applies if you intend to have users besides yourself, and certainly, if your tool is the subject of or accompanies a methods publication.
Training. Teaching code that does not fall under 3, e.g., exercises/labs used in teaching courses, or exemplary client-side “workflows” for using EMBL-EBI services.
The different categories are associated with different underlying expectations, which centre around the following concepts:
- scientific quality of results, technical quality of services
- transparency, reproducibility, traceability of results and services
- reusability and interoperability of tools and services
- sustainability of a tool or a service

Expectation

Software category Scientific / technical quality Transparency & reproducibility Sustainability Reusability and interoperability

An underlying principle for these guidelines is to be ambitious and encourage best practices for any on-going and future code development, but to be lenient towards legacy codebases. However, it is also clear that legacy projects will increasingly need to explain themselves where they do not meet best practices and measure up against the more progressive competition.

Every EMBL employee who writes or maintains (as defined below) software is responsible for deciding which of their code falls within one of these categories, and what expectations need to be fulfilled, and should be able to justify that decision to their group / team leader and their peers. Further, each employee is responsible for fulfilling the expectations of quality, transparency, sustainability and reusability in accordance with the relevant categorisation, so as to support handover of responsibilities for software between employees and to reflect positively on EMBL’s open science culture.

General guidance

Maintainer role. Each piece of code should have a maintainer, typically a single individual identified at an appropriate place in the code (e.g., file header, README file of a repository). Initially, this is its author, but an author can pass on the maintainer role to someone else, e.g., if they leave EMBL or for other business reasons. Code that has been abandoned by its previous maintainer automatically passes into maintenance by their GTL. Where suitable team structures exist, the maintainer role can also be held by a team.

Version Control. Each piece of code should be managed using a version control system that is professionally administered and backed up. By doing so, authors and maintainers should pursue the following objectives: avoiding unintended loss of code, ability to backtrack on edits, tracking of origin and assignment of credit. In 2021, viable examples include EMBL’s GitLab and GitHub.

License. Each piece of code should be covered by an open source license. Possible choices are explained on the website of the Open Source Initiative, https://opensource.org/licenses. It is recommended to use one of those described as popular, widely used, or having strong communities, and one of those that encourage reuse (‘permissive’, not copy-left). Examples of licenses that fulfil these criteria are the MIT license, and the Apache 2.0 license. For EMBL’s GitLab and GitHub, the recommended practice is to clearly communicate the license that covers the software by using a LICENSE file at repository level.

Copyright. Under staff rule 1.4.02, intellectual property and software copyrights are vested or assigned to EMBL. It is recommended to include a copyright statement alongside the license, at repository level, indicating that copyright on all code is retained by EMBL.

Further guidance for code that should be reproducible

Provision. Software of this category is released to become part of the scientific record (similar to journal publications etc.) and as such has no foreseen finite lifetime. There is no expectation that the software is actively maintained (e.g, debugged, extended or ported) beyond the publication date. Rather, its maintainer (typically a co-author of the associated scientific publication) should make sure at the time of the final revision of the paper that the code can be run by an interested third party, using a specified, “frozen” computational environment (see below), to reproduce the results (incl. figures, tables) in the paper. There is no expectation that the code should run in other (e.g., future) environments

Use a repository. Software of this category entails scripts and workflows for data cleaning, data transformation and visualization, statistical modelling, machine learning and other data refinement processes. All source code files should be managed in a version control system that must be public upon publication. Make use of software release tags to link a specific version of the code to the associated scientific publication.

Documentation. Documentation should be added to describe the data refinement process. It is recommended to use markdown languages to mix narrative text with code. Where possible, analytical results should be linked to Figures and Tables of the associated scientific publication. In 2021, viable examples are:

R Markdown documents
Jupyter Notebooks
GitHub Markdown documents

Analytic reproducibility. Some statistical methods and probabilistic algorithms require the generation of random numbers. In such cases, use random number generator seeds for reproducibility. If the workflow involves intermediate results that are costly to reproduce or require non-generic hardware (e.g, GPUs, exceptionally large RAM), if possible provide these to facilitate reproduction from there onwards.

Tool and data dependencies. Be explicit about external code and data dependencies and whenever possible use tools, data sets and application programming interfaces (APIs) that are versioned. Data sets generated as part of the study should be referenced using their accession number. Maintainers are advised to use machine readable dependency files which can be programming language specific (e.g., python requirements files) or related to the package management system (e.g., conda environment files). For data sets lacking an accession or version number, document the time and procedure how these data sources were accessed and used or consider to mirror and freeze dependencies to the extent legally allowed and practical.

Computational environment. In simple cases, the required computational environment can be reconstituted by an interested third party from an informal description of the software versions used.

For more complex cases, machine-readable and automatable methods for describing and reconstituting the environment are recommended. For instance, R offers the session info package. More generally, container technologies such as Docker or Singularity can be used to create a “frozen” computational environment. For transparency, maintainers are advised to include a document in the repository that explains how the container was built (e.g., a Dockerfile or Singularity definition file). For Category 1, containers can be monolithic with all the required tools and data sources included or mounted because reusability is not the primary concern. Alternatively, workflow management systems (see below) can be employed to ensure reproducibility and a greater level of reusability.

It is evident that full third-party reproducibility of computations is a moving target, contingent on available tools, skills and training, and specific complexities of individual projects. Therefore it should be regarded as an objective motivated by EMBL’s open science culture, towards which trade-offs and compromises may need to be made for the time being on a case-by-case basis based on what is logistically and economically feasible.

Workflows. Scientific workflow management systems can be used to facilitate reproducibility, scalability and re-use of analytical pipelines. Maintainers are advised to share their tool and workflow definitions in the code repository and public workflow registries (e.g., Dockstore or WorkflowHub). In 2021, commonly used workflow management systems are:

Further guidance for code that should be sustainable

Provision. Software of this category is intended to live beyond its original immediate purpose and be actively maintained for a number of years. This may apply to code that supports EMBL services, tools and libraries. Software that is both sustainable and reusable (see below) should be provisioned in a manner that actively supports its users, for example by updating the software to fix bugs, add documentation, or extend or port the software to the most commonly used environments.

Documentation. Software should be accompanied with sufficient documentation to enable transfer of ownership between maintainers. Documentation should define the user scenarios that are in scope for the software to support, now and in future, and define any major architectural and design decisions. Maintainers should ensure documentation is sufficient to support new developers working on the software in future. Maintainers should also ensure adequate provision is in place to address concerns raised by users of the software relating to insufficient documentation that hampers reuse.

Development practices. Establish shared practices for development, covering approaches to code style, continuous integration, and testing, to ensure code is readable and can be maintained effectively by new developers in the future.

Further guidance for code that should be reusable

Provision. Software of this category is intended to support usage beyond its original immediate purpose within EMBL (for example, beyond the life of the publication). As software of this type is expected to be exploited within and beyond EMBL, software should be provisioned in a manner that ensures it is maintained actively to ensure ongoing reuse for its original purpose, but not necessarily extended or ported to a wide variety of environments. Updates may include bug fixes, but not extension to new use cases or analysis scenarios. The maintainer should ensure measures are in-place to address bugs or concerns that hamper reuse (e.g. a lack of adequate documentation). There is no expectation as to the diversity of environments that the code should run, although the expected behaviour (e.g. which environments are supported) and the chosen development model (see below) should be clearly communicated with accompanying documentation.

Organization. To organize your software, structure it into one or several modular packages, each of which has a distinct, well-defined scope, a limited set of dependencies, and uses only a minimal number of languages. For more complex sets of tools or workflows, consider splitting them into multiple packages. The objective of this rule is to increase reusability and to simplify lifecycle management (see below).

Development Model. While these guidelines mandate open source release of code, many different development models exist, and EMBL employees need to choose the development model that best suits their scientific or technical goals and the spirit of the open source policy. The term development model refers to questions such as:

whether and how others (e.g., users) can get involved in the development and maintenance of the code,
are unreleased, under-development versions of the code managed in a public or private repository,
is the code a standalone application, or does it interact with a bigger project (by being in the form of a package, plug-in, etc.) and if so, how is attribution managed.

Use a repository/community. While it is possible to satisfy the following requirements ‘from scratch’ as an individual researcher or small research group, many are facilitated, partially automated and turned into a more pleasant experience by open source developer communities and/or package distribution platforms, and it is recommended to make use of them. Here are some examples:

For R: Bioconductor (or, with lower level of service and functionality, CRAN)
For Python: Python Package Index (PyPI)
For Conda packages: Bioconda
For trained machine learning models for genomics: Kipoi
For workflow definitions: Dockstore, WorkflowHub or Galaxy (depending on the workflow system)
For all languages: GitHub

Versioning. Provide versioned releases of your package, and archive previous releases so that users can still access them. In particular, do not silently change code ‘under the hood’. Clear and unambiguous versioning is essential for reproducibility and for support or debugging.

Dependencies. Be clear on which platform your software is supposed to run and be tested (e.g., operating system, required language compilers or interpreters, system libraries). Some package managers (e.g., CRAN/Bioconductor) distinguish between dependencies that are absolutely necessary and ones that are “nice to have”.

Ideally use the latest versions of dependencies, i.e., of other software packages that your code uses, and of input datasets. This may also mean upgrading your package whenever some upstream dependency changes (or goes away).

If your package has significant dependencies, it is strongly recommended to use a package distribution/manager platform, as this makes the installation process easier on users.

Documentation. Software should be accompanied with sufficient documentation to ensure reuse. The comprehensiveness of documentation required varies in accordance with the type of software and the repository/community chosen for distribution, and the expertise of users. Maintainers should therefore ensure adequate provision is in place to address concerns raised by users of the software relating to insufficient documentation that hampers reuse.

Unit tests, integration tests, end-to-end function tests. Implement these for your package and its components. These can help make sure that changes to code in your package, or in one of the dependencies, do not break existing functionality.

Life cycle. Developers may choose to openly expose, in addition to the releases, their on-going development of the code (e.g., via their public GitHub/GitLab repository), but it should be made clear to users that this is then an unstable “development version” and that supported use is always limited to the most recent release version. The maintainer will typically refuse to support users of older versions or of the development version. Doing so significantly reduces the effort needed for maintenance.

A maintainer may decide to stop supporting the software, e.g., because the science has moved on, because better alternatives are available, or because maintenance becomes unsustainable. In such a case, they should not switch it off suddenly, but instead announce their intention, state that the software is “deprecated”, provide a timeline for its end of life and ideally a recommendation for an alternative solution. The aim of this process is to give users enough time to react. Once the deprecation period is finished, the maintainer should make sure that this is stated on the repositories where users would expect to find it.

Alternatively, maintainers may announce their intention to “orphan” the software in the hope that someone else may adopt it.

In general, maintainership of software rests with the individual EMBL employees or teams, as described above, and there is no institutional commitment. There are processes at EMBL(-EBI) for elevating research projects to institutionally supported resources, which may also be applied to relevant pieces of software.

Forum not email. Maintainers are advised to use a specialized, formal system for receiving bug reports and feature requests, such as issue trackers and pull requests; and to use a discussion forum (such as support.bioconduct.org or GitHub discussions) for user support. They should avoid doing any of this manually by email, as it is not scalable and can easily drain all energy.

Continuous Integration. If a package has many dependencies, or is itself used by other packages (by the same maintainer or other people), it is advised to use a continuous integration framework, such as provided by GitHub actions, Bioconductor, CRAN, etc.

The above guidelines from EMBL’s Open Science Policy Implementation Guidelines are written by Isabel Bento, Tony Burdett, W. Huber, Tobias Rausch.