The CELLxGENE census from the Chan-Zuckerberg Human Cell Atlas project has “APIs” for R and python.
We’ll have a look at the R package, browsing metadata, and pulling down a dataset with selected cells.
Start with a connection. Remember to close it when done.
library(cellxgene.census)
cens = open_soma()
## The stable Census release is currently 2023-12-15. Specify census_version = "2023-12-15" in future calls to open_soma() to ensure data consistency.
cens
## <SOMACollection>
## uri: s3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/
## groups: census_data*, census_info*
Details of working with the census are in documentation. Let’s extract information specific to humans. There are layers to slice through.
cd = cens$get("census_data")
cd
## <SOMACollection>
## uri: s3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_data
## groups: homo_sapiens*, mus_musculus*
cdh = cd$get("homo_sapiens")
cdh
## <SOMAExperiment>
## uri: s3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_data/homo_sapiens
## arrays: obs*
## groups: ms*
class(cdh)
## [1] "SOMAExperiment" "SOMACollectionBase" "TileDBGroup"
## [4] "TileDBObject" "R6"
Documentation from help(SOMAExperiment)
is informative.
Variables available are:
cdh$obs$colnames()
## [1] "soma_joinid"
## [2] "dataset_id"
## [3] "assay"
## [4] "assay_ontology_term_id"
## [5] "cell_type"
## [6] "cell_type_ontology_term_id"
## [7] "development_stage"
## [8] "development_stage_ontology_term_id"
## [9] "disease"
## [10] "disease_ontology_term_id"
## [11] "donor_id"
## [12] "is_primary_data"
## [13] "self_reported_ethnicity"
## [14] "self_reported_ethnicity_ontology_term_id"
## [15] "sex"
## [16] "sex_ontology_term_id"
## [17] "suspension_type"
## [18] "tissue"
## [19] "tissue_ontology_term_id"
## [20] "tissue_general"
## [21] "tissue_general_ontology_term_id"
## [22] "raw_sum"
## [23] "nnz"
## [24] "raw_mean_nnz"
## [25] "raw_variance_nnz"
## [26] "n_measured_vars"
We can retrieve information concerning cells derived from lung samples as follows.
human$obs$read(column_names =
obs_df <-c("tissue", "development_stage", "cell_type"),
value_filter = "tissue == 'lung'
)
obs_df <- as.data.frame(obs_df$concat())
That can be slow, so we’ve saved the result to simplify surveying the stage and cell type data.
lungmeta = read.csv(system.file("csv",
"lungmeta.csv.gz", package="CDNMscrna"),
row.names=1)
tail(sort(table(lungmeta$development_stage)))
##
## 67-year-old human stage
## 109935
## 15th week post-fertilization human stage
## 110816
## 74-year-old human stage
## 121582
## 61-year-old human stage
## 130927
## 64-year-old human stage
## 153940
## unknown
## 1477039
##
## type II pneumocyte macrophage
## 255425 266333
## CD4-positive, alpha-beta T cell CD8-positive, alpha-beta T cell
## 323610 323985
## alveolar macrophage native cell
## 526859 562038
Now that we’ve seen how to filter the metadata, the following should be straightforward but possibly slow.
gene_filter <- "feature_id %in% c('ENSG00000107317',
'ENSG00000106034')"
cell_filter <- "cell_type == 'sympathetic neuron'"
cell_columns <- c("assay", "cell_type", "tissue",
"tissue_general", "suspension_type",
"disease")
seurat_obj <- get_seurat(
census = census,
organism = organism,
var_value_filter = gene_filter,
obs_value_filter = cell_filter,
obs_column_names = cell_columns
)