This package enables the interpretation and analysis of results from a gene set enrichment analysis using network-based and text-mining approaches. Most enrichment analyses result in large lists of significant gene sets that are difficult to interpret. Tools in this package help build a similarity-based network of significant gene sets from a gene set enrichment analysis that can then be investigated for their biological function using text-mining approaches.
This package implements the vissE algorithm to summarise results of gene-set analyses. Usually, the results of a gene-set enrichment analysis (e.g using limma::fry, singscore or GSEA) consist of a long list of gene-sets. Biologists then have to search through these lists to determines emerging themes to explain the altered biological processes. This task can be labour intensive therefore we need solutions to summarise large sets of results from such analyses.
This package provides an approach to provide summaries of results from gene-set enrichment analyses. It exploits the relatedness between gene-sets and the inherent hierarchical structure that may exist in pathway databases and gene ontologies to cluster results. For each cluster of gene-sets vissE identifies, it performs text-mining to automate characterisation of biological functions and processes represented by the cluster.
An additional power of vissE is to perform a novel type of gene-set enrichment analysis based on the network of similarity between gene-sets. Given a list of genes (e.g. from a DE analysis), vissE can characterise said list by first identifying all other gene-sets that are similar to it, following up with clustering the resulting gene-sets and finally performing text-mining to reveal emerging themes.
In addition to these analyses, it provides visualisations to assist the users in understanding the results of their experiment. This document will demonstrate these functions across the two use-cases. The vissE package can be downloaded as follows:
Often, the results of a gene-set enrichment analysis (be it an over representation analysis of a functional class scoring approach) is a list of gene-sets that are accompanied by their statistics and p-values or false discovery rates (FDR). These results are mostly scanned through by biologists who then extract relevant themes pertaining to the experiment of interest. The approach here, vissE, will allow automated extraction of themes.
The example below can be used with the results of any enrichment analysis. The data below is simulated to demonstrate the workflow.
library(msigdb) library(GSEABase) #load the MSigDB from the msigdb package msigdb_hs = getMsigdb() #append KEGG gene-sets - comment out to run # msigdb_hs = appendKEGG(msigdb_hs) #select h, c2, and c5 collections (recommended) msigdb_hs = subsetCollection(msigdb_hs, c('h', 'c2', 'c5')) #randomly sample gene-sets to simulate the results of an enrichment analysis set.seed(360) geneset_res = sample(sapply(msigdb_hs, setName), 2500) #create a GeneSetCollection using the gene-set analysis results geneset_gsc = msigdb_hs[geneset_res] geneset_gsc #> GeneSetCollection #> names: GOMF_GLUCOSE_6-PHOSPHATE:INORGANIC_PHOSPHATE_ANTIPORTER_ACTIVITY, GOCC_MULTIVESICULAR_BODY,_INTERNAL_VESICLE, ..., VALK_AML_WITH_11Q23_REARRANGED (2500 total) #> unique identifiers: SLC37A4, SLC37A1, ..., AC138649.1 (19216 total) #> types in collection: #> geneIdType: SymbolIdentifier (1 total) #> collectionType: BroadCollection (1 total)
A vissE analysis involves 3 steps:
The default approach to computing overlaps is using the Jaccard index. Overlap is computed based on the gene overlap between gene-sets. Alternatively, the overlap coefficient can be used. The latter can be used to highlight hierarchical overlaps (such as those present in the gene ontology).
library(vissE) #compute gene-set overlap gs_ovlap = computeMsigOverlap(geneset_gsc, thresh = 0.25) #create an overlap network gs_ovnet = computeMsigNetwork(gs_ovlap, msigdb_hs) #plot the network set.seed(36) #set seed for reproducible layout plotMsigNetwork(gs_ovnet)
The overlap network plot above is annotated using the MSigDB category. If gene-set statistics are available, they can be projected onto the network too. Gene-set statistics can be passed onto the plotting function as a named vector.
#simulate gene-set statistics geneset_stats = rnorm(2500) names(geneset_stats) = geneset_res head(geneset_stats) #> GOMF_GLUCOSE_6-PHOSPHATE:INORGANIC_PHOSPHATE_ANTIPORTER_ACTIVITY #> -0.5191669 #> GOCC_MULTIVESICULAR_BODY,_INTERNAL_VESICLE #> 2.3537577 #> HP_APLASIA_HYPOPLASIA_OF_THE_ABDOMINAL_WALL_MUSCULATURE #> 1.4520166 #> GOBP_REGULATION_OF_ANTIBACTERIAL_PEPTIDE_PRODUCTION #> 0.5466009 #> GOBP_POSITIVE_REGULATION_OF_NITRIC-OXIDE_SYNTHASE_ACTIVITY #> -0.2060802 #> GOCC_PCNA-P21_COMPLEX #> -1.5210543 #plot the network and overlay gene-set statistics set.seed(36) #set seed for reproducible layout plotMsigNetwork(gs_ovnet, genesetStat = geneset_stats)
Related gene-sets likely represent related processes. The next step
is to identify clusters of gene-sets so that they can be assessed for
biological themes. The specific clustering approach can be selected by
the user though we recommend graph clustering approaches to use the
information provided in the overlap graph. We recommend using the
igraph::cluster_walktrap() algorithm as it works well with
dense graphs. Many other algorithms are implemented in the igraph
package and these can be used instead of the walktrap algorithm.
library(igraph) #identify clusters - order based on cluster size and avg gene-set stats grps = findMsigClusters(gs_ovnet, genesetStat = geneset_stats, alg = cluster_walktrap, minSize = 5) #plot the top 12 clusters set.seed(36) #set seed for reproducible layout plotMsigNetwork(gs_ovnet, markGroups = grps[1:6], genesetStat = geneset_stats)
Instead of exploring the full network of gene-sets, the subgraph of nodes that form part of the groups can be plot. This allows for a more focused investigation into the relatedness of clusters identified using vissE.
set.seed(36) #set seed for reproducible layout plotMsigNetwork( gs_ovnet, markGroups = grps[1:6], genesetStat = geneset_stats, rmUnmarkedGroups = TRUE )
Gene-set clusters identified can be assessed for their biological similarities using text-mining approaches. Here, we perform a frequency analysis (adjusted for using the inverse document frequency) on the gene-set names or their short descriptions to assess recurring biological themes in clusters. These results are then presented as word clouds.
#compute and plot the results of text-mining #using gene-set Names plotMsigWordcloud(msigdb_hs, grps[1:6], type = 'Name')
#using gene-set Short descriptions plotMsigWordcloud(msigdb_hs, grps[1:6], type = 'Short')
Gene-level statistics for each gene-set cluster can be visualised to better understand the genes contributing to significance of gene-sets. Gene-level statistics can be passed onto the plotting function as a named vector. A jitter is applied on the x-axis (due to its discrete nature).
library(ggplot2) #simulate gene statistics set.seed(36) genes = unique(unlist(geneIds(geneset_gsc))) gene_stats = rnorm(length(genes)) names(gene_stats) = genes head(gene_stats) #> SLC37A4 SLC37A1 SLC37A2 CD63 EGFR LAPTM4B #> 0.3117314 0.8498291 0.7055331 1.6999284 -1.3455710 -0.5698134 #plot the gene-level statistics plotGeneStats(gene_stats, msigdb_hs, grps[1:6]) + geom_hline(yintercept = 0, colour = 2, lty = 2)
An alternative line of evidence for a common functional role of genes are the protein-protein interactions between them. Genes involved in a biological process are likely to interact with each other to achieve the desired function. We can therefore investigate protein-protein interactions within each cluster and thus assess evidence of a common process. In vissE, this can be done by inducing the protein-protein interaction of all genes in a gene-set cluster. Furthermore, the individual nodes in the network can be mapped onto properties such as the gene-level statistic. Networks can then be filtered based on the gene-level statistic, the confidence value of each interaction and the frequency of each gene in the cluster (i.e., how many gene-sets it belongs to).
We will retrieve the PPI from the
package. Setting inferred to TRUE will allow PPIs inferred from across
organisms to be used in the analysis.
#load PPI from the msigdb package ppi = getIMEX('hs', inferred = TRUE) #create the PPI plot set.seed(36) plotMsigPPI( ppi, msigdb_hs, grps[1:6], geneStat = gene_stats, threshStatistic = 0.2, threshConfidence = 0.2 )
Results of a vissE analysis are best presented and interpreted as paneled plots that combine all of the above plots. This allows for collective interpretation of the gene-set clusters.
library(patchwork) #create independent plots set.seed(36) #set seed for reproducible layout p1 = plotMsigWordcloud(msigdb_hs, grps[1:6], type = 'Name') p2 = plotMsigNetwork(gs_ovnet, markGroups = grps[1:6], genesetStat = geneset_stats) p3 = plotGeneStats(gene_stats, msigdb_hs, grps[1:6]) + geom_hline(yintercept = 0, colour = 2, lty = 2) p4 = plotMsigPPI( ppi, msigdb_hs, grps[1:6], geneStat = gene_stats, threshStatistic = 0.2, threshConfidence = 0.2 ) #combine using functions from ggpubr p1 + p2 + p3 + p4 + plot_layout(2, 2)
sessionInfo() #> R Under development (unstable) (2022-03-10 r81874) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Ubuntu 20.04.4 LTS #> #> Matrix products: default #> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so #> #> locale: #>  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #>  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 #>  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #>  LC_PAPER=en_US.UTF-8 LC_NAME=C #>  LC_ADDRESS=C LC_TELEPHONE=C #>  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #>  stats4 stats graphics grDevices utils datasets methods #>  base #> #> other attached packages: #>  patchwork_1.1.1 ggplot2_3.3.5 igraph_1.2.11 #>  vissE_1.3.13 GSEABase_1.57.0 graph_1.73.0 #>  annotate_1.73.0 XML_3.99-0.9 AnnotationDbi_1.57.1 #>  IRanges_2.29.1 S4Vectors_0.33.11 Biobase_2.55.0 #>  BiocGenerics_0.41.2 msigdb_1.3.1 #> #> loaded via a namespace (and not attached): #>  AnnotationHub_3.3.9 BiocFileCache_2.3.4 #>  systemfonts_1.0.4 plyr_1.8.6 #>  GenomeInfoDb_1.31.5 digest_0.6.29 #>  htmltools_0.5.2 viridis_0.6.2 #>  fansi_1.0.2 magrittr_2.0.2 #>  memoise_2.0.1 tm_0.7-8 #>  Biostrings_2.63.2 graphlayouts_0.8.0 #>  textshape_1.7.3 pkgdown_22.214.171.12400 #>  colorspace_2.0-3 blob_1.2.2 #>  rappdirs_0.3.3 ggrepel_0.9.1 #>  sylly_0.1-6 textshaping_0.3.6 #>  xfun_0.30 dplyr_1.0.8 #>  crayon_1.5.0 RCurl_1.98-1.6 #>  jsonlite_1.8.0 glue_1.6.2 #>  polyclip_1.10-0 gtable_0.3.0 #>  zlibbioc_1.41.0 XVector_0.35.0 #>  scico_1.3.0 scales_1.1.1 #>  DBI_1.1.2 qdapRegex_0.7.2 #>  Rcpp_126.96.36.199 viridisLite_0.4.0 #>  xtable_1.8-4 bit_4.0.4 #>  textclean_0.9.3 httr_1.4.2 #>  ggwordcloud_0.5.0 RColorBrewer_1.1-2 #>  ellipsis_0.3.2 pkgconfig_2.0.3 #>  farver_2.1.0 sass_0.4.0 #>  dbplyr_2.1.1 utf8_1.2.2 #>  tidyselect_1.1.2 labeling_0.4.2 #>  rlang_1.0.2 reshape2_1.4.4 #>  later_1.3.0 munsell_0.5.0 #>  BiocVersion_3.15.0 tools_4.2.0 #>  cachem_1.0.6 cli_3.2.0 #>  generics_0.1.2 RSQLite_2.2.10 #>  ExperimentHub_2.3.5 evaluate_0.15 #>  stringr_1.4.0 fastmap_1.1.0 #>  yaml_2.3.5 ragg_1.2.2 #>  textstem_0.1.4 org.Hs.eg.db_3.14.0 #>  knitr_1.37 bit64_4.0.5 #>  fs_1.5.2 tidygraph_1.2.0 #>  purrr_0.3.4 KEGGREST_1.35.0 #>  ggraph_2.0.5 koRpus_0.13-8 #>  mime_0.12 slam_0.1-50 #>  xml2_1.3.3 BiocStyle_2.23.1 #>  compiler_4.2.0 filelock_1.0.2 #>  curl_4.3.2 png_0.1-7 #>  interactiveDisplayBase_1.33.0 koRpus.lang.en_0.1-4 #>  syuzhet_1.0.6 tibble_3.1.6 #>  tweenr_1.0.2 bslib_0.3.1 #>  stringi_1.7.6 highr_0.9 #>  desc_1.4.1 lattice_0.20-45 #>  Matrix_1.4-0 vctrs_0.3.8 #>  pillar_1.7.0 lifecycle_1.0.1 #>  BiocManager_1.30.16 jquerylib_0.1.4 #>  data.table_1.14.2 bitops_1.0-7 #>  httpuv_1.6.5 sylly.en_0.1-3 #>  R6_2.5.1 promises_188.8.131.52 #>  gridExtra_2.3 lexicon_1.2.1 #>  MASS_7.3-55 assertthat_0.2.1 #>  rprojroot_2.0.2 withr_2.5.0 #>  GenomeInfoDbData_1.2.7 parallel_4.2.0 #>  grid_4.2.0 prettydoc_0.4.1 #>  tidyr_1.2.0 rmarkdown_2.13 #>  ggforce_0.3.3 NLP_0.2-1 #>  shiny_1.7.1