Edit

Zaugg Group

Systems (epi)genetics to study the basis of complex traits and diseases

Resources: Predictive features of gene expression variation reveal mechanistic link with differential expression (Mol. Sys. Biol. 2020 )

For most biological processes, organisms must respond to extrinsic cues, while maintaining essential gene expression programmes. Although studied extensively in single cells, it is still unclear how variation is controlled in multicellular organisms. Here, we used a machine‐learning approach to identify genomic features that are predictive of genes with high versus low variation in their expression across individuals, using bulk data to remove stochastic cell‐to‐cell variation. Using embryonic gene expression across 75 Drosophilaisogenic lines, we identify features predictive of expression variation (controlling for expression level), many of which are promoter‐related. Genes with low variation fall into two classes reflecting different mechanisms to maintain robust expression, while genes with high variation seem to lack both types of stabilizing mechanisms. Applying this framework to humans revealed similar predictive features, indicating that promoter architecture is an ancient mechanism to control expression variation. Remarkably, expression variation features could also partially predict differential expression after diverse perturbations in both Drosophila and humans. Differential gene expression signatures may therefore be partially explained by genetically encoded gene‐specific features, unrelated to the studied treatment.

For most biological processes, organisms must respond to extrinsic cues, while maintaining essential gene expression programmes. Although studied extensively in single cells, it is still unclear how variation is controlled in multicellular organisms. Here, we used a machine‐learning approach to identify genomic features that are predictive of genes with high versus low variation in their expression across individuals, using bulk data to remove stochastic cell‐to‐cell variation. Using embryonic gene expression across 75 Drosophilaisogenic lines, we identify features predictive of expression variation (controlling for expression level), many of which are promoter‐related. Genes with low variation fall into two classes reflecting different mechanisms to maintain robust expression, while genes with high variation seem to lack both types of stabilizing mechanisms. Applying this framework to humans revealed similar predictive features, indicating that promoter architecture is an ancient mechanism to control expression variation. Remarkably, expression variation features could also partially predict differential expression after diverse perturbations in both Drosophila and humans. Differential gene expression signatures may therefore be partially explained by genetically encoded gene‐specific features, unrelated to the studied treatment.

Code is available here

Gene-specific feature table for Drosophila (01_master_table_final_fly_EV02.csv)
Collection of all features per gene in Drosophila. Feature names are explained in Feature details Drosophila (02_features_info_fly_EV01.csv).

  • gene_name (gene symbol)
  • gene_id (Flybase ID v6.13)
  • {Feature name} (before feature selection, see 02_features_info_fly_EV01.csv )
  • median (median expression level)
  • residents_cv (expression variation at 10-12h)

Feature details Drosophila (02_features_info_fly_EV01.csv)Features used to predict expression level and variation in Drosophila.

Important features Drosophila (03_important_features_fly_EV04.csv)Feature importance scores (from Boruta) and correlations with predicted variables. Only features important in at least one prediction are included. NA indicate non-significant features in the corresponding predictions. Columns:

  • Feature name (feature name as used in the master table)
  • Full name (full feature name)
  • Class (feature class)
  • med_imp_var (median feature importance for predicting expression variation)
  • med_imp_med (median feature importance for predicting median expression level)
  • cor_var (feature correlation with expression variation)
  • cor_med (feature correlation with median expression level)

Gene-specific feature table human (05_aggregted_expression_human_EV17.csv). Collection of gene-specific features for human genes (gene_id) as comma separated file. Feature details are explained in  06_Feature_details_human_EV11.csv.

Tissue-specific expression level and variation human genes (04_all_variations_human_EV10.csv)Gene- and tissue-specific expression data and several gene annotations for human genes. Columns names for tissue-specific expression contain tissue name. NA indicates that a gene is not expressed (or did not pass filtering criteria) in the corresponding tissue.  

  • Gene_name (gene symbol)
  • Gene_id (Ensembl gene ID)
  • {Tissue_name}_mean (mean gene expression level (log-transformed) across individuals in the corresponding tissue)
  • {Tissue_name}_median (median gene expression level (log-transformed) across individuals)
  • {Tissue_name}_cv (coefficient of variation across individuals)
  • {Tissue_name}_recid_cv (expression variation across individuals (final measure of variation, adjusted for median dependence))
  • DE_Prior_Rank (Differential expression prior from (Crow et al. 2019))
  • GWAS_Upstream_gene_id and (EBI GWAS catalog genes upstream of GWAS hits)
  • GWAS_Downstream_gene_id (EBI GWAS catalog genes downstream of GWAS hits)
  • CEGv2_subset (essential genes)
  • Drug_targets_nelson and FDA_approved_drug_targets (drug targets)

Feature details human (06_Feature_details_human_EV11.csv). Explanation of all features used to predict expression level and variation in human.

  • Feature name (feature name as used in the master table) 
  • Feature class (e.g. transcription factors or chromatin states)

Important features human

(07_important_features_human_EV18.csv). Median feature importance scores (from Boruta) and correlationswith predicted variables (median_importance and correlation_with_responce columns). Feature importance and correlations are reported for aggregated expression level and variation (response column).

Edit