tut_01.Rmd
This vignette goes through material at the “Adding data” tutorial document. Our objective is to use R and python together, with basilisk managing the python infrastructure.
We will simulate records with schematized information about 10 proteins and then use write_nodes
to generate a CSV file.
loadBiocypher
connects the Biocypher modules to R via basilisk and reticulate. A completely isolated miniconda environment, currently using Python 3.9, manages all the python code.
library(biocBiocypher)
bcobj = loadBiocypher()
## + /home/vincent/.cache/R/basilisk/1.16.0/0/bin/conda create --yes --prefix /home/vincent/.cache/R/basilisk/1.16.0/biocBiocypher/0.0.2/bsklenv 'python=3.9' --quiet -c conda-forge
## + /home/vincent/.cache/R/basilisk/1.16.0/0/bin/conda install --yes --prefix /home/vincent/.cache/R/basilisk/1.16.0/biocBiocypher/0.0.2/bsklenv 'python=3.9' -c conda-forge
## + /home/vincent/.cache/R/basilisk/1.16.0/0/bin/conda install --yes --prefix /home/vincent/.cache/R/basilisk/1.16.0/biocBiocypher/0.0.2/bsklenv -c conda-forge 'python=3.9' 'python=3.9' 'numpy=1.23.1' 'pandas=1.4.4'
bcobj
## biocypher_refs produced with basilisk.
## use $biocypher_ref for modules, $generator_ref for simulator
gen = bcobj$generator_ref
names(gen)
## [1] "BioCypher" "Complex"
## [3] "EntrezProtein" "Interaction"
## [5] "InteractionGenerator" "Node"
## [7] "node_generator" "Protein"
## [9] "ProteinProteinInteraction" "r"
## [11] "random" "RandomPropertyProtein"
## [13] "RandomPropertyProteinIsoform" "string"
The following R code generates records on 10 proteins:
## [1] "get_id" "get_label" "get_properties" "id"
## [5] "label" "properties"
prots[[1]]$properties
## $sequence
## [1] "GMFDKPDEKETCPNDSILKKIHHYMDVSRVDHNDILVDADGGEWSNCLACAKQIDTAWMYKTYF"
##
## $description
## [1] "y k m s n u s c i y"
##
## $taxon
## [1] "9606"
This list is not known to the python main module (__main__
) however. We need to use
reticulate::py_run_string("proteins = [Protein() for _ in range(10)]")
names(reticulate::py) # symbols known to main
## [1] "BioCypher" "Complex"
## [3] "EntrezProtein" "Interaction"
## [5] "InteractionGenerator" "Node"
## [7] "node_generator" "Protein"
## [9] "ProteinProteinInteraction" "proteins"
## [11] "r" "random"
## [13] "RandomPropertyProtein" "RandomPropertyProteinIsoform"
## [15] "string"
Several configuration files are defined for this specific tutorial.
bc_config_path = system.file("tutorial_0.5.11",
"01_biocypher_config.yaml", package="biocBiocypher")
schema_config_path = system.file("tutorial_0.5.11",
"01_schema_config.yaml", package="biocBiocypher")
readLines(schema_config_path)
## [1] "protein:" " represented_as: node"
## [3] " preferred_id: uniprot" " input_label: uniprot_protein"
These configurations are loaded into the main interface. We “update” the YAML in bc_config_path
so that the output folder is user-selectable. The default output folder is a temporary folder.
bc = bcobj$biocypher_ref
bc_configd = bc$BioCypher(
biocypher_config_path=update_bc_config(bc_config_path),
schema_config_path=schema_config_path
)
The node_generator
was written to use a globally defined variable proteins
. That was defined above with py_run_string
.
We write out the nodes:
bc_configd$write_nodes(gen$node_generator())
## [1] TRUE
We can retrieve the configured output directory from bc_configd
. In this case the files are a ‘header’ and a semicolon-delimited data file. We parse and put them together in the following, then create a searchable HTML table.
od= bc_configd$base_config$output_directory
fi = dir(od, full=TRUE,patt="part")
he = strsplit(readLines(dir(od, full=TRUE, patt="head"), warn=FALSE), ";")[[1]]
dat = read.delim(fi, sep=";", h=FALSE)
names(dat) = he
library(DT)
datatable(dat)
cat(reticulate::py_capture_output(bc_configd$summary()))
## Showing ontology structure based on https://github.com/biolink/biolink-model/raw/v3.2.1/biolink-model.owl.ttl
## entity
## └── named thing
## └── biological entity
## └── polypeptide
## └── protein
The main Biocypher processes we have examined thus far include
protein: # mapping
represented_as: node # schema configuration
preferred_id: uniprot # uniqueness
input_label: uniprot_protein # connection to input stream
configuring the “back end”, in this case ‘offline’, also in YAML
transforming a stream of protein identifier data into graph nodes
bc_configd$write_nodes(gen$node_generator())
The node information was serialized to a tabular format.
In the next tutorial we will combine information from different annotation types.
## R version 4.4.0 Patched (2024-04-29 r86495)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/vincent/R-4-4-dist2/lib/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DT_0.33 biocBiocypher_0.0.2 basilisk_1.16.0
## [4] reticulate_1.38.0 dplyr_1.1.4 BiocStyle_2.32.1
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-0 jsonlite_1.8.8 compiler_4.4.0
## [4] BiocManager_1.30.23 filelock_1.0.3 Rcpp_1.0.13
## [7] tidyselect_1.2.1 parallel_4.4.0 jquerylib_0.1.4
## [10] png_0.1-8 systemfonts_1.1.0 textshaping_0.4.0
## [13] yaml_2.3.10 fastmap_1.2.0 lattice_0.22-6
## [16] R6_2.5.1 generics_0.1.3 knitr_1.48
## [19] htmlwidgets_1.6.4 tibble_3.2.1 bookdown_0.40
## [22] desc_1.4.3 bslib_0.8.0 pillar_1.9.0
## [25] rlang_1.1.4 utf8_1.2.4 dir.expiry_1.12.0
## [28] cachem_1.1.0 xfun_0.46 fs_1.6.4
## [31] sass_0.4.9 cli_3.6.3 withr_3.0.0
## [34] pkgdown_2.1.0 magrittr_2.0.3 crosstalk_1.2.1
## [37] digest_0.6.36 grid_4.4.0 lifecycle_1.0.4
## [40] vctrs_0.6.5 evaluate_0.24.0 glue_1.7.0
## [43] ragg_1.3.2 fansi_1.0.6 rmarkdown_2.27
## [46] basilisk.utils_1.16.0 tools_4.4.0 pkgconfig_2.0.3
## [49] htmltools_0.5.8.1