Authors: Vince Carey2,
Abstract: Multilingual data science strategies can increase efficiency of discovery by taking advantage of diverse data management and analysis strategies.
In this workshop we will examine interplay between R, Python, and Apache Spark in genetic and single-cell applications. CITE-seq studies simultaneously quantify surface protein and mRNA abundance in single cells. We will use scviR to compare interpretations based on deep learning and sequential component-specific methods.
The UK Biobank is the foundation of thousands of genome-wide association studies. The Telomere-to-Telomere project produced the first gapless human reference genome. Both of these resources will be explored using BiocHail. Workshop attendees will acquire an understanding of Aaron Lun’s basilisk package and its use in isolating specific collections of python modules, the anndata representations and scvi-tools analyses of CITE-seq data, and the hail.is approach to structuring and analyzing massive genetics data resources using Spark Resilient Distributed Data. All programming will be carried out in R; quarto documents that mix R and python will also be illustrated.
It will be helpful to have an acquaintance with
This is a 90 minute workshop that will cover
Activity | Time |
---|---|
Verify setup | 5m |
Motivations/discussion | 5m |
basilisk and basilisk.utils | 10m |
(concerns with bloat) | |
OSCA advanced: CITE-seq | 15m |
break | 10 m |
scviR – AnnData and tutorial VAE | 20m |
Hail: exploring 1KG and UKBB | 20m |
Review | 5m |
There are three basic phases of working with basilisk to interoperate with python in an R package.
First, define the BasiliskEnvironment
, in which python dependencies are explicitly declared. Here’s an example from scviR, an interface to scvi-tools of the scverse. All explicit dependencies have versions pinned.
bsklenv <- basilisk::BasiliskEnvironment(
envname = "bsklenv",
pkgname = "scviR",
packages = c("numpy==1.23.1", "pandas==1.4.4"),
pip = c(
"scvi-tools==0.20.0", "scanpy==1.9.1", # "scikit-misc==0.1.4",
"matplotlib==3.6.3"
)
)
The BasiliskEnvironment
instance is an unexported symbol in the client package namespace. basiliskStart
. In general, client functions use basiliskStart to fork the R process and use reticulate to call python methods defined in the BasiliskEnvironment
instance. basiliskRun
is then used with the environment to work with python modules via reticulate.
scviR <- function() {
proc <- basilisk::basiliskStart(bsklenv)
on.exit(basilisk::basiliskStop(proc))
basilisk::basiliskRun(proc, function() {
reticulate::import("scvi")
})
}
The approach used above has worked for some time, but it should be modified to have the parameter persist
set to TRUE in the basiliskRun
calls.
If basiliskStart cannot load the python packages specified in the BasiliskEnvironment instance, it will call basilisk.utils::installConda()
. This
downloads and runs a Miniconda installer to
create a dedicated Conda instance that is managed by 'basilisk',
separate from other instances that might be available on the
system.
Every version of basilisk and every version of any client package will, by default, have its own Miniconda installation. Tooling for removing conda environments for obsolete versions is available in basilisk.utils
.
Compare the results of
system(paste("du -sh", basilisk.utils::getCondaDir()))
system(paste("du -sh", R.home()))
Note that the python infrastructure installed here supports anndata, scvi-tools, and hail and all their python dependencies.
Use the commands
library(scviR)
library(reticulate)
scvi = scviR()
names(scvi)
Now try
py_help(scvi)
py_help(scvi$module)
This is a good way to learn about the capabilities of the imported python infrastructure.
Using the two client packages scviR
and BiocHail
in a single R session is currently inadvisable. In the following, I had previously run BiocHail::hail_init()
> scvi = scviR()
Global seed set to 0
> hl
Module(hail)
> scvi
<pointer: 0x0>
It is possible that proper use of the persist
parameter with basiliskRun would prevent this. In this workshop we will restart R when moving from one package to another.
Harvard Medical School↩