A1: initial tutorial for biocBiocypher

Introduction

This vignette goes through material at the “Adding data” tutorial document. Our objective is to use R and python together, with basilisk managing the python infrastructure.

We will simulate records with schematized information about 10 proteins and then use write_nodes to generate a CSV file.

The primary interface

loadBiocypher connects the Biocypher modules to R via basilisk and reticulate. A completely isolated miniconda environment, currently using Python 3.9, manages all the python code.

library(biocBiocypher)
bcobj = loadBiocypher()

## + /home/vincent/.cache/R/basilisk/1.16.0/0/bin/conda create --yes --prefix /home/vincent/.cache/R/basilisk/1.16.0/biocBiocypher/0.0.2/bsklenv 'python=3.9' --quiet -c conda-forge

## + /home/vincent/.cache/R/basilisk/1.16.0/0/bin/conda install --yes --prefix /home/vincent/.cache/R/basilisk/1.16.0/biocBiocypher/0.0.2/bsklenv 'python=3.9' -c conda-forge

## + /home/vincent/.cache/R/basilisk/1.16.0/0/bin/conda install --yes --prefix /home/vincent/.cache/R/basilisk/1.16.0/biocBiocypher/0.0.2/bsklenv -c conda-forge 'python=3.9' 'python=3.9' 'numpy=1.23.1' 'pandas=1.4.4'

bcobj

## biocypher_refs produced with basilisk.
##  use $biocypher_ref for modules, $generator_ref for simulator

The data generator

gen = bcobj$generator_ref
names(gen)

##  [1] "BioCypher"                    "Complex"                     
##  [3] "EntrezProtein"                "Interaction"                 
##  [5] "InteractionGenerator"         "Node"                        
##  [7] "node_generator"               "Protein"                     
##  [9] "ProteinProteinInteraction"    "r"                           
## [11] "random"                       "RandomPropertyProtein"       
## [13] "RandomPropertyProteinIsoform" "string"

The following R code generates records on 10 proteins:

prots = lapply(1:10, function(x) gen$Protein())
names(prots[[1]])

## [1] "get_id"         "get_label"      "get_properties" "id"            
## [5] "label"          "properties"

prots[[1]]$properties

## $sequence
## [1] "GMFDKPDEKETCPNDSILKKIHHYMDVSRVDHNDILVDADGGEWSNCLACAKQIDTAWMYKTYF"
## 
## $description
## [1] "y k m s n u s c i y"
## 
## $taxon
## [1] "9606"

This list is not known to the python main module (__main__) however. We need to use

reticulate::py_run_string("proteins = [Protein() for _ in range(10)]")
names(reticulate::py)  # symbols known to main

##  [1] "BioCypher"                    "Complex"                     
##  [3] "EntrezProtein"                "Interaction"                 
##  [5] "InteractionGenerator"         "Node"                        
##  [7] "node_generator"               "Protein"                     
##  [9] "ProteinProteinInteraction"    "proteins"                    
## [11] "r"                            "random"                      
## [13] "RandomPropertyProtein"        "RandomPropertyProteinIsoform"
## [15] "string"

Producing the graph nodes

Several configuration files are defined for this specific tutorial.

bc_config_path = system.file("tutorial_0.5.11", 
     "01_biocypher_config.yaml", package="biocBiocypher")
schema_config_path = system.file("tutorial_0.5.11", 
     "01_schema_config.yaml", package="biocBiocypher")
readLines(schema_config_path)

## [1] "protein:"                         "    represented_as: node"        
## [3] "    preferred_id: uniprot"        "    input_label: uniprot_protein"

These configurations are loaded into the main interface. We “update” the YAML in bc_config_path so that the output folder is user-selectable. The default output folder is a temporary folder.

bc = bcobj$biocypher_ref
bc_configd = bc$BioCypher(
    biocypher_config_path=update_bc_config(bc_config_path),
    schema_config_path=schema_config_path
)

The node_generator was written to use a globally defined variable proteins. That was defined above with py_run_string.

We write out the nodes:

bc_configd$write_nodes(gen$node_generator())

## [1] TRUE

We can retrieve the configured output directory from bc_configd. In this case the files are a ‘header’ and a semicolon-delimited data file. We parse and put them together in the following, then create a searchable HTML table.

od= bc_configd$base_config$output_directory
fi = dir(od, full=TRUE,patt="part")
he = strsplit(readLines(dir(od, full=TRUE, patt="head"), warn=FALSE), ";")[[1]]
dat = read.delim(fi, sep=";", h=FALSE)
names(dat) = he
library(DT)
datatable(dat)

cat(reticulate::py_capture_output(bc_configd$summary()))

## Showing ontology structure based on https://github.com/biolink/biolink-model/raw/v3.2.1/biolink-model.owl.ttl
## entity
## └── named thing
##     └── biological entity
##         └── polypeptide
##             └── protein

Summary

The main Biocypher processes we have examined thus far include

configuring metadata about protein annotations to be handled, in YAML

protein:                            # mapping
  represented_as: node              # schema configuration
  preferred_id: uniprot             # uniqueness
  input_label: uniprot_protein      # connection to input stream

configuring the “back end”, in this case ‘offline’, also in YAML
transforming a stream of protein identifier data into graph nodes

bc_configd$write_nodes(gen$node_generator())

The node information was serialized to a tabular format.

In the next tutorial we will combine information from different annotation types.

Session information

sessionInfo()

## R version 4.4.0 Patched (2024-04-29 r86495)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/vincent/R-4-4-dist2/lib/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] DT_0.33             biocBiocypher_0.0.2 basilisk_1.16.0    
## [4] reticulate_1.38.0   dplyr_1.1.4         BiocStyle_2.32.1   
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-0          jsonlite_1.8.8        compiler_4.4.0       
##  [4] BiocManager_1.30.23   filelock_1.0.3        Rcpp_1.0.13          
##  [7] tidyselect_1.2.1      parallel_4.4.0        jquerylib_0.1.4      
## [10] png_0.1-8             systemfonts_1.1.0     textshaping_0.4.0    
## [13] yaml_2.3.10           fastmap_1.2.0         lattice_0.22-6       
## [16] R6_2.5.1              generics_0.1.3        knitr_1.48           
## [19] htmlwidgets_1.6.4     tibble_3.2.1          bookdown_0.40        
## [22] desc_1.4.3            bslib_0.8.0           pillar_1.9.0         
## [25] rlang_1.1.4           utf8_1.2.4            dir.expiry_1.12.0    
## [28] cachem_1.1.0          xfun_0.46             fs_1.6.4             
## [31] sass_0.4.9            cli_3.6.3             withr_3.0.0          
## [34] pkgdown_2.1.0         magrittr_2.0.3        crosstalk_1.2.1      
## [37] digest_0.6.36         grid_4.4.0            lifecycle_1.0.4      
## [40] vctrs_0.6.5           evaluate_0.24.0       glue_1.7.0           
## [43] ragg_1.3.2            fansi_1.0.6           rmarkdown_2.27       
## [46] basilisk.utils_1.16.0 tools_4.4.0           pkgconfig_2.0.3      
## [49] htmltools_0.5.8.1