Introduction to oc2bioc

The oc2bioc package (currently at github.com/vjcitn/oc2bioc) incorporates the open-cravat python modules. This workshop takes advantage of R functions and interfaces defined in that package to illustrate the programmatic use of OpenCRAVAT in R/Bioconductor.

Case studies of the use of the OpenCRAVAT GUI are published in a Journal of Clinical Oncology article (Pagel et al. (2020)).

Displaying and searching the resources

We acquire the metadata for all OpenCRAVAT ‘modules’ in real time using populate_module_set.

library(oc2bioc)
modset = populate_module_set()
modset
#> OpenCRAVATModuleSet, created  Fri Sep  6 20:21:16 2024 
#> There are 392 modules.
#> Module types/counts:
#>      aggregator       annotator          common       converter           group 
#>               1             236               2               8               7 
#>          mapper         package  postaggregator        reporter          webapp 
#>               2               5               4               8               1 
#> webviewerwidget 
#>             118

We transformed the module set object to a searchable HTML table:

Single-variant queries

The queryOC function uses the OpenCRAVAT REST API to acquire information on a single position in GRCh38 coordinates. This function requires that a registered username and password be supplied. Visit run.opencravat.org to set a username and password.

An example API response is available in var_in_tx in the oc2bioc package. This was returned by the call queryOC(chr="chr7", pos="140753336", annotators=c("pubmed", "segway_breast", "chasmplus_BRCA"))

var_in_tx
#> Response [https://run.opencravat.org/submit/annotate?chrom=chr7&pos=140753336&ref_base=A&alt_base=T&annotators=chasmplus_BRCA,pubmed,segway_breast]
#>   Date: 2020-09-07 10:33
#>   Status: 200
#>   Content-Type: application/json; charset=utf-8
#>   Size: 1.31 kB

We use httr::content to explore this result.

vcon = httr::content(var_in_tx)
names(vcon)
#> [1] "pubmed"         "segway_breast"  "chasmplus_BRCA" "crx"

A peek at the result:

names(vcon)
#> [1] "pubmed"         "segway_breast"  "chasmplus_BRCA" "crx"
str(head(vcon,3))
#> List of 3
#>  $ pubmed        :List of 2
#>   ..$ n   : int 3194
#>   ..$ term: chr "http://www.ncbi.nlm.nih.gov/pubmed?term=BRAF[TIAB] AND cancer[MH]"
#>  $ segway_breast :List of 2
#>   ..$ breast_myoepithelial_cells: chr "Transcribed"
#>   ..$ breast_vhmec              : chr "Transcribed"
#>  $ chasmplus_BRCA:List of 4
#>   ..$ score     : num 0.159
#>   ..$ transcript: chr "NM_004333.4"
#>   ..$ pval      : num 0.0307
#>   ..$ results   : chr "*NM_004333.4:(0.159:0.0307)"

Click here for the PubMed references related to the gene harboring the query variant.

The crx component of the response provides information on variant impacts at the transcript level.

nl = rjson::fromJSON(vcon$crx$all_mappings)
DT::datatable(data.frame(do.call(rbind, unlist(nl, recursive=FALSE))))

Viewing and assembling collections of variants for annotation

We’ll use TCGA as a source of realistic variant sets. The curatedTCGAData package will be used.

Acquiring variants and filtering to SNVs

library(curatedTCGAData)
suppressMessages({
acc = curatedTCGAData("ACC", assays="Mutation", dry.run=FALSE,
   cache=BiocFileCache::bfccache(BiocFileCache::BiocFileCache(ask=FALSE)),
   version="2.0.1")
})
eacc = experiments(acc)[[1]]
eacc
#> class: RaggedExperiment 
#> dim: 20166 90 
#> assays(47): Hugo_Symbol Entrez_Gene_Id ... Trna_alt1 Trna_alt2
#> rownames: NULL
#> colnames(90): TCGA-OR-A5J1-01A-11D-A29I-10 TCGA-OR-A5J2-01A-11D-A29I-10
#>   ... TCGA-PK-A5HB-01A-11D-A29I-10 TCGA-PK-A5HC-01A-11D-A30A-10
#> colData names(0):
muts = as(eacc, "GRangesList")
sum(elementNROWS(muts))
#> [1] 20166
DT::datatable(as.data.frame(head(muts[[1]][,1:4])))

Notice that there is a variant with width 24. We will focus on single-nucleotide variants (SNVs).

snvs = unlist(muts[width(muts)==1])
sum(elementNROWS(snvs))
#> [1] 18787

Visualizing variants in the context of gene models

We will use a function based on the TnT package to visualize variant locations in the context of gene-like features.

Here’s a 1Mb slice:

TnTdemo(acc, viewstart=6.4e7, viewend=6.5e7)

Now we drill down a bit, to a 0.1Mb region near MAP4K2:

TnTdemo(acc, viewstart=6.455e7, viewend=6.465e7)

Annotating a large collection of variants

We use the function make_oc_POSTable to transform a GRanges instance into a data.frame that can be saved to a file and submitted to the OpenCRAVAT API.

mdf = make_oc_POSTable(snvs)
head(mdf)
#>   chr       pos ref alt                         samp var
#> 1   1  11561526   G   A TCGA-OR-A5J1-01A-11D-A29I-10 v_1
#> 2   1  12309384   T   G TCGA-OR-A5J1-01A-11D-A29I-10 v_2
#> 3   1  33820015   C   T TCGA-OR-A5J1-01A-11D-A29I-10 v_3
#> 4   1 152800122   C   T TCGA-OR-A5J1-01A-11D-A29I-10 v_4
#> 5   1 152800131   C   T TCGA-OR-A5J1-01A-11D-A29I-10 v_5
#> 6   1 173499091   C   A TCGA-OR-A5J1-01A-11D-A29I-10 v_6

The data.frame must be formatted for ingestion by OpenCRAVAT. The following call to write.table will accomplish this.

tf = tempfile()
write.table(mdf, file=tf, sep="\t", col.names=FALSE, 
   row.names=FALSE, quote=FALSE)
head(read.delim(tf, h=FALSE, sep="\t"))
#>   V1        V2 V3 V4                           V5  V6
#> 1  1  11561526  G  A TCGA-OR-A5J1-01A-11D-A29I-10 v_1
#> 2  1  12309384  T  G TCGA-OR-A5J1-01A-11D-A29I-10 v_2
#> 3  1  33820015  C  T TCGA-OR-A5J1-01A-11D-A29I-10 v_3
#> 4  1 152800122  C  T TCGA-OR-A5J1-01A-11D-A29I-10 v_4
#> 5  1 152800131  C  T TCGA-OR-A5J1-01A-11D-A29I-10 v_5
#> 6  1 173499091  C  A TCGA-OR-A5J1-01A-11D-A29I-10 v_6

References

Pagel, Kymberleigh A., Rick Kim, Kyle Moad, Ben Busby, Lily Zheng, Collin Tokheim, Michael Ryan, and Rachel Karchin. 2020. Integrated Informatics Analysis of Cancer-Related Variants.” JCO Clinical Cancer Informatics, no. 4: 310–17. https://doi.org/10.1200/cci.19.00132.