RNCBIGene: NCBI Gene annotation in parquet

Introduction

Bioconductor’s annotation system has functioned for well over a decade as a pivotal resource for analysis and tool development. See the AnnotationDbi documentation for details. AnnotationDbi and packages based on it use SQLite to manage organism-specific vocabularies and to resolve queries about mappings between identifier sets. For example, learn the Gene Ontology annotations for gene ORMDL3 in Homo sapiens:

library(org.Hs.eg.db)
go_orm = AnnotationDbi::select(org.Hs.eg.db, key="ORMDL3", keytype="SYMBOL",
    columns=c("GO", "ENTREZID"))
head(go_orm)

##   SYMBOL         GO EVIDENCE ONTOLOGY ENTREZID
## 1 ORMDL3 GO:0002903      IMP       BP    94103
## 2 ORMDL3 GO:0005515      IPI       MF    94103
## 3 ORMDL3 GO:0005783      IDA       CC    94103
## 4 ORMDL3 GO:0005789      IEA       CC    94103
## 5 ORMDL3 GO:0005886      TAS       CC    94103
## 6 ORMDL3 GO:0006672      IBA       BP    94103

With RNCBIGene, all NCBI’s GO annotations for all organisms are in a single resource. The geneFromCache package will retrieve an Apache Parquet representation of the gene2go.gz text resource distributed at NCBI’s FTP site. For this query, we use the knowledge the ORMDL3 has Gene ID 94103.

library(RNCBIGene)
library(dplyr)
go_ds = arrow::open_dataset(geneFromCache("gene2go.parquet"))
lkor = go_ds |> filter(`#tax_id`==9606, GeneID==94103) |> collect() |> as.data.frame()
DT::datatable(lkor)

It is not the purpose of this package to provide pin-compatible replacements for the AnnotationDbi packages. That would be a huge effort, and material support for such an effort does not exist at this time. This package is created to explore the opportunities for simplification and performance enhancement arising from the adoption of Parquet and Arrow for annotation representation and interrogation.

Scope

There are seven parquet files that represent the NCBI Gene content available on February 22 2025.

RNCBIGene:::available_gene_parquet

## [1] "gene2go.parquet"                     
## [2] "gene2pubmed.parquet"                 
## [3] "gene_info.parquet"                   
## [4] "gene2accession.parquet"              
## [5] "gene2refseq.parquet"                 
## [6] "gene_orthologs.parquet"              
## [7] "gene_refseq_uniprotkb_collab.parquet"

Once we have cached them using geneFromCache, we can work with all of them “simultaneously”.

Because the retrieval process is laborious, the following chunk is not evaluated. Were it to be evaluated, it would be apparent that over 720 million records of annotation are available for interrogation.

paths = lapply(RNCBIGene:::available_gene_parquet, geneFromCache)
pop = lapply(paths, arrow::open_dataset)
names(pop) = RNCBIGene:::available_gene_parquet
sapply(pop, nrow)

For illustration, the content corresponding to four human genes on chromosome 17 has been saved in the package.

litp = dir(system.file("litparquet", package="RNCBIGene"), full.names=TRUE)
litpop = lapply(litp,
    arrow::open_dataset)
ns = basename(litp)
names(litpop) = ns
sapply(litpop, nrow)

##                    lit.gene_info.parquet 
##                                        4 
##               lit.gene_orthologs.parquet 
##                                     1546 
## lit.gene_refseq_uniprotkb_collab.parquet 
##                                      986 
##               lit.gene2accession.parquet 
##                                     3265 
##                      lit.gene2go.parquet 
##                                      364 
##                  lit.gene2pubmed.parquet 
##                                    15253 
##                  lit.gene2refseq.parquet 
##                                      844

Example: find the UniProt identifiers associated with gene ORMDL3 in humans.

pro = litpop[["lit.gene2accession.parquet"]] |> 
   filter(`#tax_id` == 9606, Symbol == "ORMDL3") |> 
   select(protein_accession.version) |> collect() |> pull() |> setdiff("-")
litpop[["lit.gene_refseq_uniprotkb_collab.parquet"]] |> filter(NCBI_tax_id == 9606) |> 
   filter(`#NCBI_protein_accession` %in% pro)  |> collect()

## # A tibble: 36 × 5
##    `#NCBI_protein_accession` UniProtKB_protein_ac…¹ NCBI_tax_id UniProtKB_tax_id
##    <chr>                     <chr>                        <int>            <int>
##  1 NP_001307730.1            B3KS83                        9606             9606
##  2 NP_001307730.1            J3QRM9                        9606             9606
##  3 NP_001307730.1            Q6UY83                        9606             9606
##  4 NP_001307730.1            Q8N138                        9606             9606
##  5 NP_001307731.1            B3KS83                        9606             9606
##  6 NP_001307731.1            J3QRM9                        9606             9606
##  7 NP_001307731.1            Q6UY83                        9606             9606
##  8 NP_001307731.1            Q8N138                        9606             9606
##  9 NP_001307732.1            B3KS83                        9606             9606
## 10 NP_001307732.1            J3QRM9                        9606             9606
## # ℹ 26 more rows
## # ℹ abbreviated name: ¹UniProtKB_protein_accession
## # ℹ 1 more variable: method <chr>

Applications

Tidyverse-oriented annotation mapping

Typically one would use org.Hs.eg.db to remap identifiers from a SummarizedExperiment. We want to be able to do this for any identifier used in NCBI annotation, and for any organism. Let’s start with a dataset for human airway transcriptomics. The features are annotated with Ensembl gene identifiers, and we will add map locations and MIM codes using mapIdsNG, a variant of AnnotationDbi’s mapIds functionality for NCBI Gene.

data(airway, package="airway")
requireNamespace("tidySummarizedExperiment")

## Loading required namespace: tidySummarizedExperiment

tse = as(airway, "tidySummarizedExperiment")
tse = tse |> dplyr::mutate(map_location=mapIdsNG(keys=.feature, keytype="Ensembl", column="map_location"))
tse = tse |> dplyr::mutate(MIM=mapIdsNG(keys=.feature, keytype="Ensembl", column="MIM"))
head(table(SummarizedExperiment::rowData(tse)$map_location))

## 
##               -         10p11.1        10p11.21  10p11.21-p11.1        10p11.22 
##              12               8              29               1              24 
## 10p11.22-p11.21 
##               1

Note that the identifiers that are conveniently available for mapIdsNG are currently limited to those in gene_info:

colnames(litpop[["lit.gene_info.parquet"]])

##  [1] "#tax_id"                              
##  [2] "GeneID"                               
##  [3] "Symbol"                               
##  [4] "LocusTag"                             
##  [5] "Synonyms"                             
##  [6] "dbXrefs"                              
##  [7] "chromosome"                           
##  [8] "map_location"                         
##  [9] "description"                          
## [10] "type_of_gene"                         
## [11] "Symbol_from_nomenclature_authority"   
## [12] "Full_name_from_nomenclature_authority"
## [13] "Nomenclature_status"                  
## [14] "Other_designations"                   
## [15] "Modification_date"                    
## [16] "Feature_type"

Session information

sessionInfo()

## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] dplyr_1.1.4          RNCBIGene_0.0.4      org.Hs.eg.db_3.21.0 
##  [4] AnnotationDbi_1.71.0 IRanges_2.43.0       S4Vectors_0.47.0    
##  [7] Biobase_2.69.0       BiocGenerics_0.55.0  generics_0.1.4      
## [10] BiocStyle_2.37.0    
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1                viridisLite_0.4.2              
##  [3] farver_2.1.2                    blob_1.2.4                     
##  [5] filelock_1.0.3                  arrow_20.0.0                   
##  [7] Biostrings_2.77.1               lazyeval_0.2.2                 
##  [9] fastmap_1.2.0                   BiocFileCache_2.99.4           
## [11] duckdb_1.2.2                    digest_0.6.37                  
## [13] lifecycle_1.0.4                 ellipsis_0.3.2                 
## [15] tidySummarizedExperiment_1.19.0 KEGGREST_1.49.0                
## [17] RSQLite_2.3.11                  magrittr_2.0.3                 
## [19] compiler_4.5.0                  rlang_1.1.6                    
## [21] sass_0.4.10                     tools_4.5.0                    
## [23] utf8_1.2.5                      yaml_2.3.10                    
## [25] data.table_1.17.2               knitr_1.50                     
## [27] S4Arrays_1.9.0                  htmlwidgets_1.6.4              
## [29] bit_4.6.0                       curl_6.2.2                     
## [31] DelayedArray_0.35.1             RColorBrewer_1.1-3             
## [33] abind_1.4-8                     withr_3.0.2                    
## [35] purrr_1.0.4                     desc_1.4.3                     
## [37] grid_4.5.0                      fansi_1.0.6                    
## [39] ggplot2_3.5.2                   scales_1.4.0                   
## [41] dichromat_2.0-0.1               SummarizedExperiment_1.39.0    
## [43] cli_3.6.5                       rmarkdown_2.29                 
## [45] crayon_1.5.3                    ragg_1.4.0                     
## [47] httr_1.4.7                      DBI_1.2.3                      
## [49] cachem_1.1.0                    stringr_1.5.1                  
## [51] assertthat_0.2.1                BiocManager_1.30.25            
## [53] XVector_0.49.0                  matrixStats_1.5.0              
## [55] vctrs_0.6.5                     Matrix_1.7-3                   
## [57] ttservice_0.4.1                 jsonlite_2.0.0                 
## [59] bookdown_0.43                   bit64_4.6.0-1                  
## [61] systemfonts_1.2.3               crosstalk_1.2.1                
## [63] plotly_4.10.4                   tidyr_1.3.1                    
## [65] jquerylib_0.1.4                 glue_1.8.0                     
## [67] pkgdown_2.1.2                   DT_0.33                        
## [69] stringi_1.8.7                   gtable_0.3.6                   
## [71] GenomeInfoDb_1.45.3             GenomicRanges_1.61.0           
## [73] UCSC.utils_1.5.0                tibble_3.2.1                   
## [75] pillar_1.10.2                   rappdirs_0.3.3                 
## [77] htmltools_0.5.8.1               R6_2.6.1                       
## [79] dbplyr_2.5.0                    httr2_1.1.2                    
## [81] textshaping_1.0.1               evaluate_1.0.3                 
## [83] lattice_0.22-7                  png_0.1-8                      
## [85] memoise_2.0.1                   bslib_0.9.0                    
## [87] SparseArray_1.9.0               xfun_0.52                      
## [89] fs_1.6.6                        MatrixGenerics_1.21.0          
## [91] pkgconfig_2.0.3

Vincent J. Carey, stvjc at channing.harvard.edu

May 20, 2025

Introduction

Scope

Applications

Tidyverse-oriented annotation mapping

Session information