RNCBIGene: NCBI Gene annotation in parquet
Vincent J. Carey, stvjc at channing.harvard.edu
May 20, 2025
Source:vignettes/RNCBIGene.Rmd
RNCBIGene.RmdIntroduction
Bioconductor’s annotation system has functioned for well over a decade as a pivotal resource for analysis and tool development. See the AnnotationDbi documentation for details. AnnotationDbi and packages based on it use SQLite to manage organism-specific vocabularies and to resolve queries about mappings between identifier sets. For example, learn the Gene Ontology annotations for gene ORMDL3 in Homo sapiens:
library(org.Hs.eg.db)
go_orm = AnnotationDbi::select(org.Hs.eg.db, key="ORMDL3", keytype="SYMBOL",
columns=c("GO", "ENTREZID"))
head(go_orm)## SYMBOL GO EVIDENCE ONTOLOGY ENTREZID
## 1 ORMDL3 GO:0002903 IMP BP 94103
## 2 ORMDL3 GO:0005515 IPI MF 94103
## 3 ORMDL3 GO:0005783 IDA CC 94103
## 4 ORMDL3 GO:0005789 IEA CC 94103
## 5 ORMDL3 GO:0005886 TAS CC 94103
## 6 ORMDL3 GO:0006672 IBA BP 94103
With RNCBIGene, all NCBI’s GO annotations for all organisms are in a single resource. The geneFromCache package will retrieve an Apache Parquet representation of the gene2go.gz text resource distributed at NCBI’s FTP site. For this query, we use the knowledge the ORMDL3 has Gene ID 94103.
library(RNCBIGene)
library(dplyr)
go_ds = arrow::open_dataset(geneFromCache("gene2go.parquet"))
lkor = go_ds |> filter(`#tax_id`==9606, GeneID==94103) |> collect() |> as.data.frame()
DT::datatable(lkor)It is not the purpose of this package to provide pin-compatible replacements for the AnnotationDbi packages. That would be a huge effort, and material support for such an effort does not exist at this time. This package is created to explore the opportunities for simplification and performance enhancement arising from the adoption of Parquet and Arrow for annotation representation and interrogation.
Scope
There are seven parquet files that represent the NCBI Gene content available on February 22 2025.
RNCBIGene:::available_gene_parquet## [1] "gene2go.parquet"
## [2] "gene2pubmed.parquet"
## [3] "gene_info.parquet"
## [4] "gene2accession.parquet"
## [5] "gene2refseq.parquet"
## [6] "gene_orthologs.parquet"
## [7] "gene_refseq_uniprotkb_collab.parquet"
Once we have cached them using geneFromCache, we can work with all of them “simultaneously”.
Because the retrieval process is laborious, the following chunk is not evaluated. Were it to be evaluated, it would be apparent that over 720 million records of annotation are available for interrogation.
paths = lapply(RNCBIGene:::available_gene_parquet, geneFromCache)
pop = lapply(paths, arrow::open_dataset)
names(pop) = RNCBIGene:::available_gene_parquet
sapply(pop, nrow)For illustration, the content corresponding to four human genes on chromosome 17 has been saved in the package.
litp = dir(system.file("litparquet", package="RNCBIGene"), full.names=TRUE)
litpop = lapply(litp,
arrow::open_dataset)
ns = basename(litp)
names(litpop) = ns
sapply(litpop, nrow)## lit.gene_info.parquet
## 4
## lit.gene_orthologs.parquet
## 1546
## lit.gene_refseq_uniprotkb_collab.parquet
## 986
## lit.gene2accession.parquet
## 3265
## lit.gene2go.parquet
## 364
## lit.gene2pubmed.parquet
## 15253
## lit.gene2refseq.parquet
## 844
Example: find the UniProt identifiers associated with gene ORMDL3 in humans.
pro = litpop[["lit.gene2accession.parquet"]] |>
filter(`#tax_id` == 9606, Symbol == "ORMDL3") |>
select(protein_accession.version) |> collect() |> pull() |> setdiff("-")
litpop[["lit.gene_refseq_uniprotkb_collab.parquet"]] |> filter(NCBI_tax_id == 9606) |>
filter(`#NCBI_protein_accession` %in% pro) |> collect()## # A tibble: 36 × 5
## `#NCBI_protein_accession` UniProtKB_protein_ac…¹ NCBI_tax_id UniProtKB_tax_id
## <chr> <chr> <int> <int>
## 1 NP_001307730.1 B3KS83 9606 9606
## 2 NP_001307730.1 J3QRM9 9606 9606
## 3 NP_001307730.1 Q6UY83 9606 9606
## 4 NP_001307730.1 Q8N138 9606 9606
## 5 NP_001307731.1 B3KS83 9606 9606
## 6 NP_001307731.1 J3QRM9 9606 9606
## 7 NP_001307731.1 Q6UY83 9606 9606
## 8 NP_001307731.1 Q8N138 9606 9606
## 9 NP_001307732.1 B3KS83 9606 9606
## 10 NP_001307732.1 J3QRM9 9606 9606
## # ℹ 26 more rows
## # ℹ abbreviated name: ¹UniProtKB_protein_accession
## # ℹ 1 more variable: method <chr>
Applications
Tidyverse-oriented annotation mapping
Typically one would use org.Hs.eg.db to remap identifiers from a SummarizedExperiment. We want to be able to do this for any identifier used in NCBI annotation, and for any organism. Let’s start with a dataset for human airway transcriptomics. The features are annotated with Ensembl gene identifiers, and we will add map locations and MIM codes using mapIdsNG, a variant of AnnotationDbi’s mapIds functionality for NCBI Gene.
data(airway, package="airway")
requireNamespace("tidySummarizedExperiment")## Loading required namespace: tidySummarizedExperiment
tse = as(airway, "tidySummarizedExperiment")
tse = tse |> dplyr::mutate(map_location=mapIdsNG(keys=.feature, keytype="Ensembl", column="map_location"))
tse = tse |> dplyr::mutate(MIM=mapIdsNG(keys=.feature, keytype="Ensembl", column="MIM"))
head(table(SummarizedExperiment::rowData(tse)$map_location))##
## - 10p11.1 10p11.21 10p11.21-p11.1 10p11.22
## 12 8 29 1 24
## 10p11.22-p11.21
## 1
Note that the identifiers that are conveniently available for mapIdsNG are currently limited to those in gene_info:
colnames(litpop[["lit.gene_info.parquet"]])## [1] "#tax_id"
## [2] "GeneID"
## [3] "Symbol"
## [4] "LocusTag"
## [5] "Synonyms"
## [6] "dbXrefs"
## [7] "chromosome"
## [8] "map_location"
## [9] "description"
## [10] "type_of_gene"
## [11] "Symbol_from_nomenclature_authority"
## [12] "Full_name_from_nomenclature_authority"
## [13] "Nomenclature_status"
## [14] "Other_designations"
## [15] "Modification_date"
## [16] "Feature_type"
Session information
## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.4.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] dplyr_1.1.4 RNCBIGene_0.0.4 org.Hs.eg.db_3.21.0
## [4] AnnotationDbi_1.71.0 IRanges_2.43.0 S4Vectors_0.47.0
## [7] Biobase_2.69.0 BiocGenerics_0.55.0 generics_0.1.4
## [10] BiocStyle_2.37.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 viridisLite_0.4.2
## [3] farver_2.1.2 blob_1.2.4
## [5] filelock_1.0.3 arrow_20.0.0
## [7] Biostrings_2.77.1 lazyeval_0.2.2
## [9] fastmap_1.2.0 BiocFileCache_2.99.4
## [11] duckdb_1.2.2 digest_0.6.37
## [13] lifecycle_1.0.4 ellipsis_0.3.2
## [15] tidySummarizedExperiment_1.19.0 KEGGREST_1.49.0
## [17] RSQLite_2.3.11 magrittr_2.0.3
## [19] compiler_4.5.0 rlang_1.1.6
## [21] sass_0.4.10 tools_4.5.0
## [23] utf8_1.2.5 yaml_2.3.10
## [25] data.table_1.17.2 knitr_1.50
## [27] S4Arrays_1.9.0 htmlwidgets_1.6.4
## [29] bit_4.6.0 curl_6.2.2
## [31] DelayedArray_0.35.1 RColorBrewer_1.1-3
## [33] abind_1.4-8 withr_3.0.2
## [35] purrr_1.0.4 desc_1.4.3
## [37] grid_4.5.0 fansi_1.0.6
## [39] ggplot2_3.5.2 scales_1.4.0
## [41] dichromat_2.0-0.1 SummarizedExperiment_1.39.0
## [43] cli_3.6.5 rmarkdown_2.29
## [45] crayon_1.5.3 ragg_1.4.0
## [47] httr_1.4.7 DBI_1.2.3
## [49] cachem_1.1.0 stringr_1.5.1
## [51] assertthat_0.2.1 BiocManager_1.30.25
## [53] XVector_0.49.0 matrixStats_1.5.0
## [55] vctrs_0.6.5 Matrix_1.7-3
## [57] ttservice_0.4.1 jsonlite_2.0.0
## [59] bookdown_0.43 bit64_4.6.0-1
## [61] systemfonts_1.2.3 crosstalk_1.2.1
## [63] plotly_4.10.4 tidyr_1.3.1
## [65] jquerylib_0.1.4 glue_1.8.0
## [67] pkgdown_2.1.2 DT_0.33
## [69] stringi_1.8.7 gtable_0.3.6
## [71] GenomeInfoDb_1.45.3 GenomicRanges_1.61.0
## [73] UCSC.utils_1.5.0 tibble_3.2.1
## [75] pillar_1.10.2 rappdirs_0.3.3
## [77] htmltools_0.5.8.1 R6_2.6.1
## [79] dbplyr_2.5.0 httr2_1.1.2
## [81] textshaping_1.0.1 evaluate_1.0.3
## [83] lattice_0.22-7 png_0.1-8
## [85] memoise_2.0.1 bslib_0.9.0
## [87] SparseArray_1.9.0 xfun_0.52
## [89] fs_1.6.6 MatrixGenerics_1.21.0
## [91] pkgconfig_2.0.3