Introduction

The CELLxGENE census from the Chan-Zuckerberg Human Cell Atlas project has “APIs” for R and python.

We’ll have a look at the R package, browsing metadata, and pulling down a dataset with selected cells.

Metadata

Start with a connection. Remember to close it when done.

## The stable Census release is currently 2023-12-15. Specify census_version = "2023-12-15" in future calls to open_soma() to ensure data consistency.
cens
## <SOMACollection>
##   uri: s3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/ 
##   groups: census_data*, census_info*

Details of working with the census are in documentation. Let’s extract information specific to humans. There are layers to slice through.

cd  = cens$get("census_data")
cd
## <SOMACollection>
##   uri: s3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_data 
##   groups: homo_sapiens*, mus_musculus*
cdh = cd$get("homo_sapiens")
cdh
## <SOMAExperiment>
##   uri: s3://cellxgene-census-public-us-west-2/cell-census/2023-12-15/soma/census_data/homo_sapiens 
##   arrays: obs* 
##   groups: ms*
class(cdh)
## [1] "SOMAExperiment"     "SOMACollectionBase" "TileDBGroup"       
## [4] "TileDBObject"       "R6"

Documentation from help(SOMAExperiment) is informative.

Variables available are:

cdh$obs$colnames()
##  [1] "soma_joinid"                             
##  [2] "dataset_id"                              
##  [3] "assay"                                   
##  [4] "assay_ontology_term_id"                  
##  [5] "cell_type"                               
##  [6] "cell_type_ontology_term_id"              
##  [7] "development_stage"                       
##  [8] "development_stage_ontology_term_id"      
##  [9] "disease"                                 
## [10] "disease_ontology_term_id"                
## [11] "donor_id"                                
## [12] "is_primary_data"                         
## [13] "self_reported_ethnicity"                 
## [14] "self_reported_ethnicity_ontology_term_id"
## [15] "sex"                                     
## [16] "sex_ontology_term_id"                    
## [17] "suspension_type"                         
## [18] "tissue"                                  
## [19] "tissue_ontology_term_id"                 
## [20] "tissue_general"                          
## [21] "tissue_general_ontology_term_id"         
## [22] "raw_sum"                                 
## [23] "nnz"                                     
## [24] "raw_mean_nnz"                            
## [25] "raw_variance_nnz"                        
## [26] "n_measured_vars"

We can retrieve information concerning cells derived from lung samples as follows.

obs_df <- human$obs$read(column_names = 
     c("tissue", "development_stage", "cell_type"),
     value_filter = "tissue == 'lung'
     )
obs_df <- as.data.frame(obs_df$concat())

That can be slow, so we’ve saved the result to simplify surveying the stage and cell type data.

lungmeta = read.csv(system.file("csv", 
    "lungmeta.csv.gz", package="CDNMscrna"),
    row.names=1)
tail(sort(table(lungmeta$development_stage)))
## 
##                  67-year-old human stage 
##                                   109935 
## 15th week post-fertilization human stage 
##                                   110816 
##                  74-year-old human stage 
##                                   121582 
##                  61-year-old human stage 
##                                   130927 
##                  64-year-old human stage 
##                                   153940 
##                                  unknown 
##                                  1477039
tail(sort(table(lungmeta$cell_type)))
## 
##              type II pneumocyte                      macrophage 
##                          255425                          266333 
## CD4-positive, alpha-beta T cell CD8-positive, alpha-beta T cell 
##                          323610                          323985 
##             alveolar macrophage                     native cell 
##                          526859                          562038

Expression data

Now that we’ve seen how to filter the metadata, the following should be straightforward but possibly slow.

gene_filter <- "feature_id %in% c('ENSG00000107317', 
    'ENSG00000106034')"
cell_filter <-  "cell_type == 'sympathetic neuron'"
cell_columns <- c("assay", "cell_type", "tissue", 
     "tissue_general", "suspension_type", 
     "disease")

seurat_obj <- get_seurat(
   census = census,
   organism = organism,
   var_value_filter = gene_filter,
   obs_value_filter = cell_filter,
   obs_column_names = cell_columns
)