Last updated: 2023-06-13
Checks: 6 1
Knit directory: SpatialLibrarySizePaper/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
The R Markdown is untracked by Git. To know which version of the R
Markdown file created these results, you’ll want to first commit it to
the Git repo. If you’re still working on the analysis, you can ignore
this warning. When you’re finished, you can run
wflow_publish
to commit the R Markdown file and build the
HTML.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20230612)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version ce1a620. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rproj.user/
Ignored: analysis/figure/
Untracked files:
Untracked: analysis/libsize_cosmx.Rmd
Untracked: analysis/libsize_stomics.Rmd
Untracked: analysis/libsize_xenium.Rmd
Untracked: data/ABA_V3_ontology.csv
Unstaged changes:
Modified: analysis/_site.yml
Modified: analysis/about.Rmd
Modified: analysis/index.Rmd
Modified: analysis/license.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
There are no past versions. Publish this analysis with
wflow_publish()
to start tracking its development.
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
# load packages required for analysis
library(tidyverse)
library(patchwork)
library(SpatialExperiment)
library(ExperimentHub)
library(SubcellularSpatialData)
library(car)
All transcript detection level data is hosted on the ExperimentHub as
part of the SubcellularSpatialData
package. Data on the
ExperimentHub can be queried as below. The
SubcellularSpatialData
hosts region annotated data from 3
sub-cellular resolution spatial transcriptomics platforms (NanoString
CosMx, 10x Xenium, and BGI STOmics).
#identify datasets present in the data package
eh = ExperimentHub()
query(eh, 'SubcellularSpatialData')
ExperimentHub with 3 records
# snapshotDate(): 2023-06-01
# $dataprovider: NanoString, BGI, 10x
# $species: Mus musculus, Homo sapiens
# $rdataclass: data.frame
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["EH8230"]]'
title
EH8230 | xenium_mm_brain
EH8231 | stomics_mm_brain
EH8232 | cosmx_hs_nsclc
Transcript detection data from the package is loaded as follows and contains the x and y coordinates of each detection/spot, gene name, gene count (1 for CosMx and Xenium representing each detection), genetype, region annotations, and other technology specific information.
tx_detections = eh[['EH8230']]
head(tx_detections)
sample_id cell gene genetype x
1 Xenium_V1_FF_Mouse_Brain_MultiSection_1_outs 67490 Bhlhe40 Gene 4843.046
2 Xenium_V1_FF_Mouse_Brain_MultiSection_1_outs 67957 Parm1 Gene 4844.633
3 Xenium_V1_FF_Mouse_Brain_MultiSection_1_outs 67539 Bhlhe40 Gene 4842.943
4 Xenium_V1_FF_Mouse_Brain_MultiSection_1_outs 68003 Lyz2 Gene 4843.941
5 Xenium_V1_FF_Mouse_Brain_MultiSection_1_outs 67344 Dkk3 Gene 4843.163
6 Xenium_V1_FF_Mouse_Brain_MultiSection_1_outs 67957 Parm1 Gene 4845.258
y counts region technology level Level0 Level1 Level2 Level3 Level4
1 6427.730 1 RSPd2/3 Xenium Level11 root grey CH CTX CTXpl
2 6223.183 1 RSPv5 Xenium Level11 root grey CH CTX CTXpl
3 6478.310 1 RSPd2/3 Xenium Level11 root grey CH CTX CTXpl
4 6344.550 1 RSPd5 Xenium Level11 root grey CH CTX CTXpl
5 6632.112 1 RSPd1 Xenium Level11 root grey CH CTX CTXpl
6 6227.622 1 RSPd5 Xenium Level11 root grey CH CTX CTXpl
Level5 Level6 Level7 Level8 Level9 Level10 Level11 transcript_id
1 Isocortex <NA> <NA> RSP RSPd <NA> RSPd2/3 2.81475e+14
2 Isocortex <NA> <NA> RSP RSPv <NA> RSPv5 2.81475e+14
3 Isocortex <NA> <NA> RSP RSPd <NA> RSPd2/3 2.81475e+14
4 Isocortex <NA> <NA> RSP RSPd <NA> RSPd5 2.81475e+14
5 Isocortex <NA> <NA> RSP RSPd <NA> RSPd1 2.81475e+14
6 Isocortex <NA> <NA> RSP RSPd <NA> RSPd5 2.81475e+14
overlaps_nucleus z_location qv
1 1 19.06887 40.00000
2 0 18.52016 40.00000
3 1 18.50011 13.81358
4 0 15.01615 19.47188
5 0 15.39468 40.00000
6 0 16.95615 40.00000
#total detections per sample
tx_detections |>
group_by(sample_id) |>
summarise(TotalDetections = sum(counts))
# A tibble: 3 × 2
sample_id TotalDetections
<chr> <dbl>
1 Xenium_V1_FF_Mouse_Brain_MultiSection_1_outs 62744602
2 Xenium_V1_FF_Mouse_Brain_MultiSection_2_outs 58716497
3 Xenium_V1_FF_Mouse_Brain_MultiSection_3_outs 59424126
This workflow analyses the 10x Xenium data. The colour map used by the Allen Brain Atlas (ABA) is used to visualise annotated data points. This is stored in the ABA_V3_ontology.csv file.
#Load ABA v3 ontology colour map
ont_df = read.csv('data/ABA_V3_ontology.csv')
colmap = ont_df$color_hex_triplet
names(colmap) = ont_df$acronym
The Allen Brain Atlas annotates regions of the brain at multiple levels (Level 0 to 11). Each data point is assigned an annotation for each level. The overall annotation of each datapoint is the lowest non-missing level. For example, if a detection has annotations for levels 1-5 and 8-9 but missing annotations for levels 6-7 and 10-11, the detection’s annotation would be the level 9 annotation. The region column stores the lowest non-mission annotation while the level column stores the corresponding annotation level. Sub-structures of the CA region, specifically the CA1, CA2 and CA3, pyramidal layers (CA1sp, CA2sp, and CA3sp respectively) that were missing in the ABA were manually annotated.
#region summaries
tx_detections |>
filter(genetype %in% 'Gene') |>
group_by(sample_id, region) |>
summarise(TotalDetections = sum(counts)) |>
mutate(PropDetections = TotalDetections / sum(TotalDetections)) |>
arrange(desc(PropDetections)) |>
ggplot(
aes(
fct_reorder(region, PropDetections, median),
PropDetections,
fill = region,
group = paste(sample_id, region)
)
) +
geom_bar(stat = 'identity', position = 'dodge') +
scale_fill_manual(values = colmap) +
labs(x = 'Region', y = 'Proportion of detections') +
vissE::bhuvad_theme() +
theme(legend.position = 'none', axis.text.x = element_blank())
#level summaries
tx_detections |>
filter(genetype %in% 'Gene') |>
group_by(sample_id, level) |>
summarise(TotalDetections = sum(counts)) |>
mutate(PropDetections = TotalDetections / sum(TotalDetections)) |>
arrange(desc(PropDetections)) |>
ggplot(
aes(
fct_reorder(level, PropDetections, median),
PropDetections,
fill = level,
group = paste(sample_id, level)
)
) +
geom_bar(stat = 'identity', position = 'dodge') +
scale_fill_viridis_d(option = 'G') +
labs(x = 'Level', y = 'Proportion of detections') +
vissE::bhuvad_theme() +
theme(legend.position = 'none')
Since cell boundary detection is still a work in progress, we bin
individual detections in to a hexagonal grid with 200 bins along each
axis. The results are saved into a SpatialExperiment
object
where each column represents a bin and each row a gene.
spe = tx2spe(tx_detections, bin = 'hex', nbins = 200)
spe$ntranscripts = colSums(counts(spe)[rowData(spe)$genetype %in% 'Gene', ])
spe
class: SpatialExperiment
dim: 541 119844
metadata(0):
assays(1): counts
rownames(541): Bhlhe40 Parm1 ... BLANK_0433 BLANK_0110
rowData names(2): gene genetype
colnames(119844): Xenium_V1_FF_Mouse_Brain_MultiSection_1_outs_1
Xenium_V1_FF_Mouse_Brain_MultiSection_1_outs_2 ...
Xenium_V1_FF_Mouse_Brain_MultiSection_3_outs_37370
Xenium_V1_FF_Mouse_Brain_MultiSection_3_outs_37371
colData names(24): sample_id bin_id ... Level11 ntranscripts
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
spatialCoords names(2) : x y
imgData names(1): sample_id
Visualising transcript density using hexbin plots where detections are binned in to a hexagonal grid reveals tissue structure.
hex_counts = as.data.frame(colData(spe)) |>
cbind(as.data.frame(spatialCoords(spe)))
#library size
p1 = hex_counts |>
ggplot(aes(x, y, colour = ntranscripts / 1000)) +
geom_point(size = 0.1) +
scale_colour_viridis_c(option = 'F') +
facet_wrap(~ sample_id, scales = 'free') +
labs(x = '', y = '', colour = 'Library size (1000s)') +
vissE::bhuvad_theme() +
theme(
axis.text = element_blank(),
panel.background = element_rect(fill = '#444444')
)
#region
p2 = hex_counts |>
ggplot(aes(x, y, colour = region)) +
geom_point(size = 0.1) +
scale_colour_manual(values = colmap, guide = guide_none()) +
facet_wrap(~ sample_id, scales = 'free') +
labs(x = '', y = '', colour = '') +
vissE::bhuvad_theme() +
theme(
axis.text = element_blank(),
panel.background = element_rect(fill = '#444444')
)
p1 / p2 + plot_layout(guides = 'collect') & theme(legend.position = 'bottom')
We can assess the average transcripts detected per cell (library size) across regions without assigning transcripts to cells.
\(AvgLibrarySize_{region} = \frac{\#transcripts_{region}}{\#cells_{region}}\)
tx_detections |>
filter(genetype %in% 'Gene') |>
group_by(sample_id, region) |>
summarise(LibrarySize = sum(counts) / length(unique(cell))) |>
ggplot(
aes(
LibrarySize / 1000,
fct_reorder(region, LibrarySize, mean),
fill = region
),
) +
geom_bar(stat = 'identity') +
scale_fill_manual(values = colmap, guide = guide_none()) +
labs(x = 'Region', y = 'Library size per cell (1000s)') +
facet_wrap(~ sample_id, scales = 'free_x', ncol = 3) +
vissE::bhuvad_theme()
We can determine the factors driving transcript detection across space by assessing their effects using binned data. We expect library sizes to be higher in spatial regions with many cells. Based on the observations in the previous sections, we would also suspect that different tissue regions will produce different numbers of transcripts, despite having similar cellular desity. Visualising this effect using the plot below shows how the primary contributor of transcript detections is the number of cells in the binned region. We also see that there is a region specific effect where the library size shifts up or down depending on the region being assessed.
Grey regions representing fibre tracts of the brain have larger library sizes where more cells are present, however, these library sizes are lower than most other regions of the brain on average. In contrast, the cortex (dark green regions), has much larger library sizes that most other regions and the magnitude of relationship between the number of cells and library sizes (the slope) is much larger (steeper slope).
hex_counts |>
ggplot(aes(ncells, ntranscripts / 1000, colour = region)) +
geom_jitter(width = 0.4, size = 0.3) +
scale_colour_manual(values = colmap, guide = guide_none()) +
labs(x = 'Cells per bin', y = 'Detections per bin (1000s)', caption = '*jitter applied for visualisation purposes') +
facet_wrap(~ sample_id, ncol = 3, scales = 'free') +
vissE::bhuvad_theme()
Point pattern spatial data can be modelled using an inhomogeneous Poisson point process. A simple approach to model library sizes across space is to bin detections and then model binned counts using a generalised linear models (GLMs) with a Poisson link function. Counts per bin can be modelled as a function of the number of cells, the tissue region and other technology specific covariates such as the field of view. Additionally, if other spatial effects are present, functions of the coordinates can be used to model them. We model the former knowledge driven spatial trends. However, as is evident from some of these analyses, there may be spatial effects around the boundaries of tissues and these can be modelled using the x,y coordinates and their interactions.
#fit models
libsize_models = hex_counts |>
filter(!is.na(region), ncells > 0 | ntranscripts > 0) |>
group_by(sample_id) |>
summarise(model = list(glm(ntranscripts ~ 0 + ncells * region, family = poisson)),
across(everything(), list)) |>
mutate(model = unlist(model, recursive = FALSE))
We can visualise the Pearson and standard residuals to assess the model fit.
#visualise residuals
residuals_df = libsize_models |>
mutate(
Predicted = lapply(model, predict),
Residuals = lapply(model, residuals),
ResidualsPearson = lapply(model, residuals, type = 'pearson'),
) |>
unnest(!c(sample_id, model))
#plot residuals ves fitted values
residuals_df |>
ggplot() +
geom_point(aes(x, y, colour = Residuals), size = 0.1) +
facet_wrap(~ sample_id, scales = 'free') +
scico::scale_color_scico(palette = 'bam',
oob = scales::squish,
limits = c(-50, 50) * 1) +
labs(title = 'Residuals by location', x = '', y = '') +
vissE::bhuvad_theme() +
theme(axis.text = element_blank(), legend.position = 'bottom')
#plot residuals ves fitted values
residuals_df |>
filter(Predicted > 4) |>
ggplot(aes(Predicted, Residuals)) +
geom_point(aes(colour = region), size = 0.4) +
geom_hline(yintercept = 0, col = 1, lwd = 0.2) +
geom_smooth(se = FALSE, col = 2, lwd = 0.5, method = 'loess') +
facet_wrap(~ sample_id, ncol = 1) +
scale_colour_manual(values = colmap, guide = guide_none()) +
labs(
x = 'log(Predicted library size)',
y = 'Residuals',
title = 'Residuals vs fitted plot',
caption = '*3 outliers removed for visualisation purposes'
) +
vissE::bhuvad_theme()
#plot residuals ves fitted values
residuals_df |>
filter(Predicted > 4) |>
ggplot(aes(Predicted, sqrt(abs(ResidualsPearson)))) +
geom_point(aes(colour = region), size = 0.4) +
geom_smooth(se = FALSE, col = 2, lwd = 0.5, method = 'loess') +
facet_wrap(~ sample_id, ncol = 1) +
scale_colour_manual(values = colmap, guide = guide_none()) +
labs(
x = 'log(Predicted library size)',
y = expression(sqrt('|Std. Pearson resid.|')),
title = 'Scale-location plot',
caption = '*3 outliers removed for visualisation purposes'
) +
vissE::bhuvad_theme()
#plot residuals ves fitted values
residuals_df |>
ggplot() +
geom_boxplot(aes(reorder(region, Residuals, FUN = var), Residuals, fill = region)) +
geom_hline(yintercept = 0, col = 2) +
facet_wrap(~ sample_id, ncol = 1) +
scale_fill_manual(values = colmap, guide = guide_none()) +
labs(title = 'Residuals by region') +
vissE::bhuvad_theme() +
labs(x = 'Region') +
theme(axis.text.x = element_blank())
Analysing the residuals shows that:
Having fit the model, we can analyse the effect size of different regions. The modelling we perform allows each region to have its own linear trend between the number of cells and the total trascripts detected. We can therefore analyse the slope and intercepts of these relationships across regions.
This analysis shows that the slope (rate of increase in transcript density with respect to cell density) is generally higher in the cortex (dark green) than in the fibre tracts of the brain (grays).
coef_df = libsize_models |>
mutate(Coefficient = lapply(model, coef),
Region = lapply(Coefficient, names)) |>
unnest(c(Coefficient, Region)) |>
select(sample_id, Coefficient, Region) |>
group_by(sample_id) |>
mutate(NCell = Coefficient[Region == 'ncells']) |>
ungroup() |>
mutate(
Term = case_when(
grepl('^region[[:alnum:]/ ]+$', Region) ~ 'CoefRegion',
grepl('^ncells:region[[:alnum:]/ ]+$', Region) ~ 'CoefInteraction',
.default = NA_character_
),
Region = gsub('^(ncells:)?region', '', Region)
) |>
drop_na(Term) |>
pivot_wider(names_from = Term, values_from = Coefficient, values_fill = 0) |>
mutate(slope = NCell + CoefInteraction, intercept = CoefRegion)
coef_df |>
ggplot() +
geom_jitter(aes(exp(slope), exp(intercept) / 1e3, colour = Region)) +
scale_colour_manual(values = colmap, guide = guide_none()) +
facet_wrap(~ sample_id) +
labs(
x = 'Fold-change per cell',
y = 'Average library size (1000s)',
caption = '* 7 outlier removed for visualisation purposes'
) +
lims(x = c(0.95, 1.17), y = c(0, 3)) +
vissE::bhuvad_theme()
We can perform an analysis of variance (ANOVA) analysis on the model
to determine the importance of region and the cell density in predicting
transcript density. We use a Type II ANOVA test from the
car
package to do so.
aov_results = libsize_models |>
select(sample_id, model) |>
mutate(ANOVA = lapply(model, car::Anova, test.statistic = 'F')) |>
mutate(ANOVASummary = lapply(ANOVA, function(x) as.data.frame(x))) |>
mutate(Covariate = lapply(ANOVA, function(x) {
as.matrix(x) |> rownames() |> trimws()
})) |>
unnest(c(ANOVASummary, Covariate)) |>
filter(!Covariate %in% 'Residuals') |>
select(!c(model, ANOVA))
aov_results |>
DT::datatable(filter = 'top') |>
DT::formatSignif(2:5, digits = 4)
sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /stornext/System/data/apps/R/R-4.3.0/lib64/R/lib/libRblas.so
LAPACK: /stornext/System/data/apps/R/R-4.3.0/lib64/R/lib/libRlapack.so; LAPACK version 3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] car_3.1-2 carData_3.0-5
[3] SubcellularSpatialData_0.99.4 ExperimentHub_2.8.0
[5] AnnotationHub_3.9.1 BiocFileCache_2.9.0
[7] dbplyr_2.3.2 SpatialExperiment_1.11.0
[9] SingleCellExperiment_1.23.0 SummarizedExperiment_1.31.1
[11] Biobase_2.61.0 GenomicRanges_1.53.1
[13] GenomeInfoDb_1.37.1 IRanges_2.35.1
[15] S4Vectors_0.39.1 BiocGenerics_0.47.0
[17] MatrixGenerics_1.13.0 matrixStats_1.0.0
[19] patchwork_1.1.2 lubridate_1.9.2
[21] forcats_1.0.0 stringr_1.5.0
[23] dplyr_1.1.2 purrr_1.0.1
[25] readr_2.1.4 tidyr_1.3.0
[27] tibble_3.2.1 ggplot2_3.4.2
[29] tidyverse_2.0.0 workflowr_1.7.0
loaded via a namespace (and not attached):
[1] rstudioapi_0.14 jsonlite_1.8.5
[3] magrittr_2.0.3 magick_2.7.4
[5] farver_2.1.1 rmarkdown_2.22
[7] fs_1.6.2 zlibbioc_1.47.0
[9] vctrs_0.6.2 memoise_2.0.1
[11] DelayedMatrixStats_1.23.0 RCurl_1.98-1.12
[13] htmltools_0.5.5 S4Arrays_1.1.4
[15] curl_5.0.1 Rhdf5lib_1.23.0
[17] SparseArray_1.1.9 rhdf5_2.45.0
[19] sass_0.4.6 bslib_0.5.0
[21] htmlwidgets_1.6.2 cachem_1.0.8
[23] igraph_1.4.3 whisker_0.4.1
[25] mime_0.12 lifecycle_1.0.3
[27] pkgconfig_2.0.3 Matrix_1.5-4.1
[29] R6_2.5.1 fastmap_1.1.1
[31] GenomeInfoDbData_1.2.10 shiny_1.7.4
[33] digest_0.6.31 colorspace_2.1-0
[35] AnnotationDbi_1.63.1 ps_1.7.5
[37] rprojroot_2.0.3 dqrng_0.3.0
[39] crosstalk_1.2.0 RSQLite_2.3.1
[41] beachmat_2.17.6 labeling_0.4.2
[43] filelock_1.0.2 fansi_1.0.4
[45] timechange_0.2.0 mgcv_1.8-42
[47] abind_1.4-5 httr_1.4.6
[49] compiler_4.3.0 vissE_1.9.0
[51] bit64_4.0.5 withr_2.5.0
[53] BiocParallel_1.35.2 DBI_1.1.3
[55] hexbin_1.28.3 highr_0.10
[57] HDF5Array_1.29.3 R.utils_2.12.2
[59] rappdirs_0.3.3 DelayedArray_0.27.4
[61] rjson_0.2.21 tools_4.3.0
[63] interactiveDisplayBase_1.39.0 httpuv_1.6.11
[65] R.oo_1.25.0 glue_1.6.2
[67] callr_3.7.3 nlme_3.1-162
[69] rhdf5filters_1.13.2 promises_1.2.0.1
[71] grid_4.3.0 getPass_0.2-2
[73] generics_0.1.3 gtable_0.3.3
[75] tzdb_0.4.0 R.methodsS3_1.8.2
[77] hms_1.1.3 utf8_1.2.3
[79] XVector_0.41.1 BiocVersion_3.18.0
[81] pillar_1.9.0 limma_3.57.2
[83] later_1.3.1 splines_4.3.0
[85] lattice_0.21-8 bit_4.0.5
[87] tidyselect_1.2.0 locfit_1.5-9.8
[89] Biostrings_2.69.1 scuttle_1.11.0
[91] knitr_1.43 git2r_0.32.0
[93] edgeR_3.43.3 xfun_0.39
[95] DropletUtils_1.21.0 DT_0.28
[97] stringi_1.7.12 yaml_2.3.7
[99] evaluate_0.21 codetools_0.2-19
[101] BiocManager_1.30.21 cli_3.6.1
[103] xtable_1.8-4 munsell_0.5.0
[105] processx_3.8.1 jquerylib_0.1.4
[107] Rcpp_1.0.10 png_0.1-8
[109] parallel_4.3.0 ellipsis_0.3.2
[111] blob_1.2.4 sparseMatrixStats_1.13.0
[113] bitops_1.0-7 viridisLite_0.4.2
[115] scales_1.2.1 crayon_1.5.2
[117] scico_1.4.0 rlang_1.1.1
[119] KEGGREST_1.41.0