Open Data and Data Management – Open Science at EMBL

The essential requirements for data at EMBL are:

Have a Data Management Plan (DMP) for each project
Publish the data behind publications as open data

Open data sharing is at the heart of academic research and an essential prerequisite for transparency, reproducibility and trustworthiness of research results. Open data sharing is also in the interest of research funders, including societies via their taxes, to stimulate new research and development through data reevaluation and reuse. An important precondition to data sharing is good data management. There should be an uninterrupted and auditable chain of information from reagents to raw data to publication. It is the aim that anybody should be able to follow this chain, not just the people who have been involved in the work.

There is no single definition of how much data is enough to ensure experimental reproducibility; this will vary by experiment type and domain. These data guidelines are intended to provide practical guidance to EMBL researchers on how to achieve the objectives of the EMBL Open Science policy with respect to data management. They are organised as a step-by-step guide to taking the appropriate action at each stage, starting from the generation of data to its deposition in a trusted public repository.

Data Management Plans (DMPs)

Starting a new project

Establishing a chain of information long after the experiments have been concluded is lengthy and error-prone and experience shows that critical information is lost when people leave EMBL.

Therefore, before a new project is started, you should take some time to establish a Data Management Plan (DMP) that considers the lifecycle of your data, following this Data Management Checklist. The steps described below should then be followed from the beginning of your project as a practical implementation of your DMP. One of the goals of this implementation should be to capture your data and metadata in a machine readable form as early as possible when data is generated.

Writing a Data Management Plan

EMBL requires that all projects have a data management plan. This includes any work that is funded by grants, or is part of the PhD and postdoctoral projects, or is intended to support a scientific article or is part of a collaborative effort. As a default, EMBL researchers should use this EMBL Data Management Plan Template.

Please also check with your funder or project partners for possible additional requirements, ensuring EMBL’s data management requirements are satisfied.

Exploratory work is not expected to have a DMP. Nevertheless, the data has to be managed and should it become evident that the work will lead to something beyond exploring, well-managed data is a good starting point and ensures a strong foundation for future work, so please use the following Guidelines at the earliest stage if the work is leading somewhere.

Register and document your data

To ensure the production of the best reusable open data (FAIR), you should adhere to best practices of data production and processing EARLY in experimental design, ideally at the moment of production, and NOT at the point of publishing a research article (see section on DMPs). This includes the following:

Ensure the rules of engagement are clear with collaborators i.e. that EMBL is expecting FAIR, whole datasets
When you use EMBL facilities, it is the responsibility of the group leader to ensure data produced is effectively managed FAIRly, in particular with regards to metadata. Use existing solutions to register datasets. Ask the facility if they can help you with this.
For data produced by lab-based experiments, all data should be electronically recorded at the moment of production in a machine readable format.
Data produced externally and brought to EMBL should also be managed in the same way as data produced inside the lab.

We expect that as data management services are developed and aligned with the Data Science Theme, better practices will be encouraged and easier to implement. There follows a selected list of tools which can already now support you in effectively registering, managing and documenting your data:

EMBL STOCKS (Electronic Lab Notebook and Data Management)
EMBL Data Management Application (DMA)
Electronic Lab Notebooks (ELN), e.g. eLabJournal (e.g. with links to data sets in DMA)

If you cannot use any of these tools, please find here some recommendations on how to manage your data effectively.

Releasing your data as open data

As stated in the policy, EMBL expects as a minimum all data behind research articles to be made public and adhere to the FAIR principles.

EMBL expects the whole dataset to be published, i.e. not only the “positive” results or the data discussed in the publication. The scope of the dataset should have been previously defined in the Data Management Plan.
What does FAIR and open data mean in practice for EMBL?
a. F(indable): being in a trusted community repository (see 3., below)
b. A(ccessible) and I(nteroperable): use open standards both for data and metadata including permanent data identifiers such as DOIs and Accession Numbers
c. R(eusable): relevant documentation of raw and metadata in standard formats. The appropriate open data license will be offered by the community data repository.
Data should be deposited in the appropriate trusted community databases as a top priority.
a. Recommended databases are listed here. For image data, use the BioImage Archive
b. Controlled-Access data should be deposited in the EGA (or federated EGA database) and is considered open even though data can only be accessed after approval via a Data Access Committee.
c. If data does not fit in a recommended community database, please use BioStudies to store and/or to refer to the data (e.g. if no data-specific community database exists (See e.g. https://www.ebi.ac.uk/biostudies/studies/S-BSST479)
d. Use community-accepted open standard formats for data and metadata. Don’t use PDFs for data.
e. In a research article, always include a Data Availability Statement with links/accessions/DOIs specifying the data. Please note that “data on request” does not qualify as such a statement.
f. Link from figures to data specifically when the journal allows.

This guideline was written by R. Lueck, J. Klemeier, J. Marquez, J. McEntyre, U. Sarkans, J.-K. Hériché, A. Kreshuk for the EMBL Open Science Implementation Guidelines.

For additional support in data analysis, data management, and data services. Please contact EMBL’s Data Science Centre.