vignettes/mastR_customized_design.Rmd
mastR_customized_design.RmdSimplified Customized Design For Differential Expression Analysis
mastR provides a simplified customized contrast design
for differential expression analysis, which can help users handle the
complex experimental design and data structure in one simple function
call.
The function process_data() in mastR allows
users to pass a customized contrast matrix to the function, which can
give users more flexibility.
mastR R package can be installed from Bioconductor or GitHub.
The most updated version of mastR is hosted on GitHub
and can be installed using devtools::install_github()
function provided by devtools.
# if (!requireNamespace("devtools", quietly = TRUE)) {
# install.packages("devtools")
# }
# if (!requireNamespace("mastR", quietly = TRUE)) {
# devtools::install_github("DavisLaboratory/mastR")
# }
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (!requireNamespace("mastR", quietly = TRUE)) {
BiocManager::install("mastR")
}
packages <- c(
"BiocStyle",
"clusterProfiler",
"ComplexHeatmap",
"depmap",
"enrichplot",
"ggrepel",
"Glimma",
"gridExtra",
"jsonlite",
"knitr",
"rmarkdown",
"RobustRankAggreg",
"rvest",
"singscore",
"UpSetR"
)
for (i in packages) {
if (!requireNamespace(i, quietly = TRUE)) {
install.packages(i)
}
if (!requireNamespace(i, quietly = TRUE)) {
BiocManager::install(i)
}
}Here we use the example data im_data_6 from GSE60424
(Download using GEOquery::getGEO()), consisting of immune
cells from healthy individuals.
im_data_6 is a eSet object, containing
RNA-seq TMM normalized counts data of 6 sorted immune cell types each
with 4 samples. More details in ?mastR::im_data_6.
data("im_data_6")
im_data_6
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 50045 features, 24 samples
#> element names: exprs
#> protocolData: none
#> phenoData
#> sampleNames: GSM1479438 GSM1479439 ... GSM1479525 (24 total)
#> varLabels: title geo_accession ... years since diagnosis:ch1 (66
#> total)
#> varMetadata: labelDescription
#> featureData: none
#> experimentData: use 'experimentData(object)'
#> pubMedIds: 25314013
#> Annotation: GPL15456The customized contrast matrix can be created using
makeContrasts() function from limma
package.
Users can create the customized contrast matrix manually by specifying the contrast names and the levels of the groups.
## DE of NK vs B and B vs T
con_mat <- makeContrasts(
'NK-CD4' = 'NK - CD4',
'NK-T' = 'NK - (CD4 + CD8)/2',
levels = levels(factor(make.names(im_data_6$`celltype:ch1`)))
)
con_mat
#> Contrasts
#> Levels NK-CD4 NK-T
#> B.cells 0 0.0
#> CD4 -1 -0.5
#> CD8 0 -0.5
#> Monocytes 0 0.0
#> Neutrophils 0 0.0
#> NK 1 1.0However, it is important to note that the levels of the groups should be consistent with the levels of the groups in the expression matrix. Otherwise, the contrast matrix will not be correct and the analysis will stop with an error.
So it is recommended and safer to create the customized contrast
matrix from the design matrix generated from process_data()
function.
What we need to do is to first process the data using
process_data() function with random target group, then
extract the design matrix from the proc_data object.
Next, we can create the customized contrast matrix from the design matrix.
proc_data <- mastR::process_data(
im_data_6,
group_col = 'celltype:ch1',
target_group = 'NK',
summary = FALSE,
gene_id = "ENSEMBL" ## rownames of im_data_6 is ENSEMBL ID
)
con_mat2 <- makeContrasts(
'NK-CD4' = 'NK - CD4',
'NK-T' = 'NK - (CD4 + CD8)/2',
levels = proc_data$vfit$design
)
con_mat2
#> Contrasts
#> Levels NK-CD4 NK-T
#> B.cells 0 0.0
#> CD4 -1 -0.5
#> CD8 0 -0.5
#> Monocytes 0 0.0
#> Neutrophils 0 0.0
#> NK 1 1.0
identical(con_mat, con_mat2)
#> [1] TRUEThen, we can use the process_data() function to obtain
the DE results with the customized contrast design.
At this point, the DE analysis is performed based on the customized contrast design, regardless of the target group.
proc_data <- mastR::process_data(
im_data_6,
group_col = 'celltype:ch1',
target_group = 'NK',
contrast_mat = con_mat, ## specify contrast of NK vs B and B vs T
summary = TRUE,
gene_id = "ENSEMBL" ## rownames of im_data_6 is ENSEMBL ID
)
#> NK-CD4 NK-T
#> Down 2694 2678
#> NotSig 4985 4935
#> Up 2732 2798
## plot mean-var
mastR::plot_mean_var(proc_data)
The DE results are stored in the proc_data object and
can be easily accessed via proc_data$tfit.
## contrast names
colnames(proc_data$tfit)
#> [1] "NK-CD4" "NK-T"
## DE results for 'NK-B' contrast
na.omit(limma::topTreat(
proc_data$tfit,
coef = 1, # or 'NK-B' for the first contrast
number = Inf # get all DE results
)) |> head()
#> logFC AveExpr t P.Value adj.P.Val
#> ENSG00000229164 -5.277165 3.816770 -27.57282 1.573629e-18 9.709102e-15
#> ENSG00000167286 -5.685866 2.309187 -26.96013 2.542590e-18 9.709102e-15
#> ENSG00000137078 -5.862098 1.745241 -26.77409 2.947367e-18 9.709102e-15
#> ENSG00000172673 -6.209083 1.783905 -26.29011 4.348395e-18 9.709102e-15
#> ENSG00000160185 -6.268015 1.707258 -26.20408 4.662906e-18 9.709102e-15
#> ENSG00000065357 -3.492239 7.339261 -25.51861 8.197948e-18 1.422481e-14Of course, users can also use the get_de_table()
function to easily get all DE result tables for all contrasts on a
single call.
## DE results for all contrasts
DE_table <- mastR::get_de_table(
im_data_6,
group_col = 'celltype:ch1',
target_group = 'NK',
contrast_mat = con_mat, ## specify contrast of NK vs B and B vs T
summary = TRUE,
gene_id = "ENSEMBL" ## rownames of im_data_6 is ENSEMBL ID
)
#> NK-CD4 NK-T
#> Down 2694 2678
#> NotSig 4985 4935
#> Up 2732 2798
names(DE_table)
#> [1] "NK-CD4" "NK-T"
head(DE_table[[1]])
#> logFC AveExpr t P.Value adj.P.Val
#> ENSG00000229164 -5.277165 3.816770 -27.57282 1.573629e-18 9.709102e-15
#> ENSG00000167286 -5.685866 2.309187 -26.96013 2.542590e-18 9.709102e-15
#> ENSG00000137078 -5.862098 1.745241 -26.77409 2.947367e-18 9.709102e-15
#> ENSG00000172673 -6.209083 1.783905 -26.29011 4.348395e-18 9.709102e-15
#> ENSG00000160185 -6.268015 1.707258 -26.20408 4.662906e-18 9.709102e-15
#> ENSG00000065357 -3.492239 7.339261 -25.51861 8.197948e-18 1.422481e-14Users can use the glimmaMDS(), glimmaMA(),
and glimmaVolcano() functions from Glimma
package to visualize the data and DE results interactively.
## MDS plot
Glimma::glimmaMDS(proc_data)
## MA plot
Glimma::glimmaMA(proc_data$tfit, dge = proc_data)
## volcano plot
Glimma::glimmaVolcano(proc_data$tfit, dge = proc_data)
sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] GSEABase_1.64.0 graph_1.80.0 annotate_1.80.0
#> [4] XML_3.99-0.16.1 AnnotationDbi_1.64.1 IRanges_2.36.0
#> [7] S4Vectors_0.40.2 Biobase_2.62.0 BiocGenerics_0.48.1
#> [10] ggplot2_3.5.0 edgeR_4.0.16 limma_3.58.1
#> [13] mastR_1.11.1 BiocStyle_2.30.0
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.2 bitops_1.0-7
#> [3] rlang_1.1.3 magrittr_2.0.3
#> [5] matrixStats_1.3.0 compiler_4.3.3
#> [7] RSQLite_2.3.6 png_0.1-8
#> [9] systemfonts_1.0.6 vctrs_0.6.5
#> [11] pkgconfig_2.0.3 crayon_1.5.2
#> [13] fastmap_1.1.1 backports_1.4.1
#> [15] XVector_0.42.0 labeling_0.4.3
#> [17] utf8_1.2.4 rmarkdown_2.26
#> [19] ragg_1.3.0 purrr_1.0.2
#> [21] bit_4.0.5 xfun_0.43
#> [23] zlibbioc_1.48.2 cachem_1.0.8
#> [25] GenomeInfoDb_1.38.8 jsonlite_1.8.8
#> [27] blob_1.2.4 highr_0.10
#> [29] DelayedArray_0.28.0 broom_1.0.5
#> [31] parallel_4.3.3 R6_2.5.1
#> [33] bslib_0.7.0 parallelly_1.37.1
#> [35] car_3.1-2 GenomicRanges_1.54.1
#> [37] jquerylib_0.1.4 SummarizedExperiment_1.32.0
#> [39] Rcpp_1.0.12 bookdown_0.39
#> [41] knitr_1.46 future.apply_1.11.2
#> [43] Matrix_1.6-5 tidyselect_1.2.1
#> [45] abind_1.4-5 yaml_2.3.8
#> [47] codetools_0.2-20 listenv_0.9.1
#> [49] lattice_0.22-6 tibble_3.2.1
#> [51] withr_3.0.0 KEGGREST_1.42.0
#> [53] evaluate_0.23 future_1.33.2
#> [55] desc_1.4.3 Biostrings_2.70.3
#> [57] pillar_1.9.0 BiocManager_1.30.22
#> [59] ggpubr_0.6.0 MatrixGenerics_1.14.0
#> [61] carData_3.0-5 generics_0.1.3
#> [63] sp_2.1-3 RCurl_1.98-1.14
#> [65] munsell_0.5.1 scales_1.3.0
#> [67] globals_0.16.3 xtable_1.8-4
#> [69] glue_1.7.0 tools_4.3.3
#> [71] locfit_1.5-9.9 ggsignif_0.6.4
#> [73] fs_1.6.3 grid_4.3.3
#> [75] tidyr_1.3.1 SingleCellExperiment_1.24.0
#> [77] colorspace_2.1-0 GenomeInfoDbData_1.2.11
#> [79] patchwork_1.2.0 msigdb_1.10.0
#> [81] cli_3.6.2 textshaping_0.3.7
#> [83] fansi_1.0.6 S4Arrays_1.2.1
#> [85] dplyr_1.1.4 gtable_0.3.5
#> [87] rstatix_0.7.2 sass_0.4.9
#> [89] digest_0.6.35 progressr_0.14.0
#> [91] SparseArray_1.2.4 farver_2.1.1
#> [93] org.Hs.eg.db_3.18.0 htmlwidgets_1.6.4
#> [95] SeuratObject_4.1.4 memoise_2.0.1
#> [97] htmltools_0.5.8.1 pkgdown_2.0.9
#> [99] lifecycle_1.0.4 httr_1.4.7
#> [101] statmod_1.5.0 bit64_4.0.5