vehicle for summarizing developments on scalable array processing
The purpose of this gh-pages site is to collect links and comments concerning array processing methods relevant to Bioconductor.
A common objective is to hide complexities of working with very large data while permitting great flexibility in use of available computing resources. There are many tradeoffs to be navigated, and a comprehensive account of the issues will take substantial work. A brief sketch follows.
“On-disk” management of array data can take various forms and is handled in various packages.
Sparse matrix representations are also relevant.
The new ALTREP concepts of base R are also relevant.
Lazy computation. The idea here is that we defer computations as long as possible, to take advantage of reductions in computational effort that may arise from filtering
Aaron Lun’s simpleSingleCell workflow addresses aspects of storage and approximate matrix decompositions.
The Orchestrating Single Cell Analysis (OSCA) project includes an Rmarkdown file with an overview of big-data methods.
Peter Hickey’s Bioc 2019 workshop on DelayedArray provides many details.
> assay(tenx)
<27998 x 1306127> DelayedMatrix object of type "integer":
AAACCTGAGATAGGAG-1 ... TTTGTCATCTGAAAGA-133
[1,] 0 . 0
[2,] 0 . 0
[3,] 0 . 0
[4,] 0 . 0
[5,] 0 . 0
... . . .
[27994,] 0 . 0
[27995,] 1 . 0
[27996,] 0 . 0
[27997,] 0 . 0
[27998,] 0 . 0
Additionally, the vignette presents issues involved with a sparse representation, managed in this case in HDF5:
> h5ls(fname)
group name otype dclass dim
0 / mm10 H5I_GROUP
1 /mm10 barcodes H5I_DATASET STRING 1306127
2 /mm10 data H5I_DATASET INTEGER 2624828308
3 /mm10 gene_names H5I_DATASET STRING 27998
4 /mm10 genes H5I_DATASET STRING 27998
5 /mm10 indices H5I_DATASET INTEGER 2624828308
6 /mm10 indptr H5I_DATASET INTEGER 1306128
7 /mm10 shape H5I_DATASET INTEGER 2
Have a look at the scalability channel at community-bioc.slack.com