Introduction

This vignette moves beyond the A1 vignette, which simply imported information on simulated proteins identified by Uniprot identifiers.

Now we will use two identifier types: Uniprot and NCBI Entrez.

Setup

library(biocBiocypher)
bcobj = loadBiocypher()
bcobj
## biocypher_refs produced with basilisk.
##  use $biocypher_ref for modules, $generator_ref for simulator

The data generator

gen = bcobj$generator_ref
names(gen)
##  [1] "BioCypher"                    "Complex"                     
##  [3] "EntrezProtein"                "Interaction"                 
##  [5] "InteractionGenerator"         "Node"                        
##  [7] "node_generator"               "Protein"                     
##  [9] "ProteinProteinInteraction"    "r"                           
## [11] "random"                       "RandomPropertyProtein"       
## [13] "RandomPropertyProteinIsoform" "string"

In the previous vignette, we used Protein(). Now we will use EntrezProtein().

prots = lapply(1:10, function(x) gen$EntrezProtein())
names(prots[[1]])
## [1] "get_id"         "get_label"      "get_properties" "id"            
## [5] "label"          "properties"
prots[[1]]$properties
## $sequence
## [1] "FTQGRWSQCMHIRVKKEMYEHGFLDGPMIHSRPPDQLQHNAEAPNNDEGDPLNRFLWFDIDADAEFIASLRMIGLCAAEGHMQGNLSVQLLEWMSWIPYECLPPFDIGKETTW"
## 
## $description
## [1] "m l f r v w j v o x"
## 
## $taxon
## [1] "9606"

The python code to work with Entrez and Uniprot together is:

twop = "proteins = [
    p for sublist in zip(
        [Protein() for _ in range(10)],
        [EntrezProtein() for _ in range(10)],
    ) for p in sublist
]"
tt = tempfile()
writeLines(twop, tt)

We run the code to get the protein metadata and sequences.

reticulate::py_run_file(tt)

names(reticulate::py)  # symbols known to main
##  [1] "BioCypher"                    "Complex"                     
##  [3] "EntrezProtein"                "Interaction"                 
##  [5] "InteractionGenerator"         "Node"                        
##  [7] "node_generator"               "Protein"                     
##  [9] "ProteinProteinInteraction"    "proteins"                    
## [11] "r"                            "random"                      
## [13] "RandomPropertyProtein"        "RandomPropertyProteinIsoform"
## [15] "string"

Producing the graph nodes

As before, we use the configuration files defined for this specific tutorial.

bc_config_path = system.file("tutorial_0.5.11", 
     "02_biocypher_config.yaml", package="biocBiocypher")
schema_config_path = system.file("tutorial_0.5.11", 
     "02_schema_config.yaml", package="biocBiocypher")
readLines(schema_config_path)
## [1] "protein:"                                          
## [2] "    represented_as: node"                          
## [3] "    preferred_id: uniprot"                         
## [4] "    input_label: [uniprot_protein, entrez_protein]"

These configurations are loaded into the main interface. We “update” the YAML in bc_config_path so that the output folder is user-selectable. The default output folder is a temporary folder.

bc = bcobj$biocypher_ref
bc_configd = bc$BioCypher(
    biocypher_config_path=update_bc_config(bc_config_path),
    schema_config_path=schema_config_path
)

The node_generator was written to use a globally defined variable proteins. That was defined above with py_run_string.

We write out the nodes:

bc_configd$write_nodes(gen$node_generator())
## [1] TRUE

We retrieve the configured output directory from bc_configd. The files are again a ‘header’ and a semicolon-delimited data file. We parse and put them together in the following, then create a searchable HTML table.

od= bc_configd$base_config$output_directory
fi = dir(od, full=TRUE,patt="part")
he = strsplit(readLines(dir(od, full=TRUE, patt="head"), warn=FALSE), ";")[[1]]
dat = read.delim(fi, sep=";", h=FALSE)
names(dat) = he
library(DT)
datatable(dat)
cat(reticulate::py_capture_output(bc_configd$summary()))
## Showing ontology structure based on https://github.com/biolink/biolink-model/raw/v3.2.1/biolink-model.owl.ttl
## entity
## └── named thing
##     └── biological entity
##         └── polypeptide
##             └── protein

Summary

Data generation in this vignette uses two namespaces to produce graph nodes.

Session information

## R version 4.4.0 Patched (2024-04-29 r86495)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/vincent/R-4-4-dist2/lib/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] DT_0.33             biocBiocypher_0.0.2 basilisk_1.16.0    
## [4] reticulate_1.38.0   dplyr_1.1.4         BiocStyle_2.32.1   
## 
## loaded via a namespace (and not attached):
##  [1] Matrix_1.7-0          jsonlite_1.8.8        compiler_4.4.0       
##  [4] BiocManager_1.30.23   filelock_1.0.3        Rcpp_1.0.13          
##  [7] tidyselect_1.2.1      parallel_4.4.0        jquerylib_0.1.4      
## [10] png_0.1-8             systemfonts_1.1.0     textshaping_0.4.0    
## [13] yaml_2.3.10           fastmap_1.2.0         lattice_0.22-6       
## [16] R6_2.5.1              generics_0.1.3        knitr_1.48           
## [19] htmlwidgets_1.6.4     tibble_3.2.1          bookdown_0.40        
## [22] desc_1.4.3            bslib_0.8.0           pillar_1.9.0         
## [25] rlang_1.1.4           utf8_1.2.4            dir.expiry_1.12.0    
## [28] cachem_1.1.0          xfun_0.46             fs_1.6.4             
## [31] sass_0.4.9            cli_3.6.3             withr_3.0.0          
## [34] pkgdown_2.1.0         magrittr_2.0.3        crosstalk_1.2.1      
## [37] digest_0.6.36         grid_4.4.0            lifecycle_1.0.4      
## [40] vctrs_0.6.5           evaluate_0.24.0       glue_1.7.0           
## [43] ragg_1.3.2            fansi_1.0.6           rmarkdown_2.27       
## [46] basilisk.utils_1.16.0 tools_4.4.0           pkgconfig_2.0.3      
## [49] htmltools_0.5.8.1