BiocAnnotationLab: developing new approaches to genomic annotation curation
Vincent J. Carey, stvjc at channing.harvard.edu
February 19, 2026
Source:vignettes/BiocAnnotationLab.Rmd
BiocAnnotationLab.RmdMotivation
Bioconductor’s annotation resources are extensive and can be challenging to maintain for various reasons. This package collects information relevant to
- reducing complexity of annotation management
- taking advantage of new approaches to data representation
- supporting “self-service” solutions for those needing rapid revision of annotation resources
Basic facts:
-
GO.db- Package is downloaded by over 20000 distinct IPs per month
- It is powered by a 70MB SQLite database with 14 tables following a bespoke schema
- The production of the GO.db package is mixed with production of numerous other annotation packages in the BioconductorAnnotationPipeline system.
- A parquet-based representation with equivalent content can fit in 7MB of disk. See GO.db3 and note the OBO-parquet transformation resource at BiocGOprep.
- Yet another approach to Gene Ontology (and indeed all ontologies at OBOFoundry) leverages semantic SQL – see ontoProc2, in development.
-
org.*.*.db- All the org.*.eg.db are based on curation of resources from NCBI
- These do not need to be carved up for different organisms; see the RNCBIgene package in development.
- Project-specific resources like SGD, TAIR, PLASMO had special casing in the pipeline, and this needs a fresh, maintainable approach
-
TxDb.*.*...- The architect is still with the project
- Tooling to move from SQLite to parquet is available and produces 7x reduction in footprint
- Upstream tasks are well-documented, but should be reviewed regularly
Gene Ontology: approaches to curation
GO.db
Based on download counts, GO.db is a very popular resource in Bioconductor. Here’s a basic use case, taken from the sources of goana.R in limma.
TERM <- suppressMessages(AnnotationDbi::select(GO.db::GO.db,keys=GOID,columns="TERM"))
Try it out:
GOID = c("GO:0001960", "GO:0001961", "GO:0010803", "GO:0060334", "GO:0060338",
"GO:0070099", "GO:0070103", "GO:0070107", "GO:0070758", "GO:1900234",
"GO:1902205", "GO:1902211", "GO:1902214", "GO:1902226", "GO:1903881",
"GO:2000446", "GO:2000492", "GO:2000659")
library(GO.db)
TERM <- suppressMessages(AnnotationDbi::select(GO.db::GO.db,keys=GOID,columns="TERM"))
head(TERM)## GOID TERM
## 1 GO:0001960 negative regulation of cytokine-mediated signaling pathway
## 2 GO:0001961 positive regulation of cytokine-mediated signaling pathway
## 3 GO:0010803 regulation of tumor necrosis factor-mediated signaling pathway
## 4 GO:0060334 regulation of type II interferon-mediated signaling pathway
## 5 GO:0060338 regulation of type I interferon-mediated signaling pathway
## 6 GO:0070099 regulation of chemokine-mediated signaling pathway
GO.db is self-descriptive.
GO.db## GODb object:
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
## | GOSOURCEDATE: 2025-07-22
## | Db type: GODb
## | package: AnnotationDbi
## | DBSCHEMA: GO_DB
## | GOEGSOURCEDATE: 2025-Sep24
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | DBSCHEMAVERSION: 2.1
##
## Please see: help('select') for usage information
We can learn about the underlying database.
## [1] 73560064
DBI::dbListTables(GO_dbconn())## [1] "go_bp_offspring" "go_bp_parents" "go_cc_offspring" "go_cc_parents"
## [5] "go_mf_offspring" "go_mf_parents" "go_obsolete" "go_ontology"
## [9] "go_synonym" "go_term" "map_counts" "map_metadata"
## [13] "metadata" "sqlite_stat1"
GO.db presents environments with traversals of the ontological hierarchy.
get("GO:0001959", GO.db::GOBPCHILDREN)## isa isa isa isa isa isa
## "GO:0001960" "GO:0001961" "GO:0010803" "GO:0060334" "GO:0060338" "GO:0070099"
## isa isa isa isa isa isa
## "GO:0070103" "GO:0070107" "GO:0070758" "GO:1900234" "GO:1902205" "GO:1902211"
## isa isa isa isa isa isa
## "GO:1902214" "GO:1902226" "GO:1903881" "GO:2000446" "GO:2000492" "GO:2000659"
GO.db3: a parquet-based approach
In pursuit of a proof of concept, packages GO.db2 (SQLite-based) and GO.db3 have been produced. GO.db2 uses a simplified schema for the tables; GO.db3 uses parquet representations of GO.db’s tables.
To avoid conflict with dplyr’s select, GO.db3 defines select3 to emulate AnnotationDbi::select.
library(GO.db3)
chk = select3("GO.db3",keys=GOID, columns=c("GOID", "TERM"), keytype="GOID")
head(chk)## GOID TERM
## 1 GO:0001960 negative regulation of cytokine-mediated signaling pathway
## 2 GO:0001961 positive regulation of cytokine-mediated signaling pathway
## 3 GO:0010803 regulation of tumor necrosis factor-mediated signaling pathway
## 4 GO:0060334 regulation of type II interferon-mediated signaling pathway
## 5 GO:0060338 regulation of type I interferon-mediated signaling pathway
## 6 GO:0070099 regulation of chemokine-mediated signaling pathway
We can learn about the underlying resources, in extdata/go323 and observe a sharp reduction in size of data footprint.
pfns = dir(system.file("extdata", "go323", package="GO.db3"), full=TRUE)
head(basename(pfns))## [1] "go_bp_offspring.parquet" "go_bp_parents.parquet"
## [3] "go_cc_offspring.parquet" "go_cc_parents.parquet"
## [5] "go_mf_offspring.parquet" "go_mf_parents.parquet"
## [1] 7329583
We wrap the hierarchy-oriented environments in functions for now. The environments are computed “on the fly”; only a few have been produced.
get("GO:0001960", GO.db3::GOBPPARENTS())## is_a is_a is_a
## "GO:0001959" "GO:0009968" "GO:0060761"
## negatively_regulates
## "GO:0019221"
ontoProc2 for GO: using Semantic SQL
The INCAtools semantic SQL project converts ontologies in OWL or OBO formats to SQLite. The README remarks that
SQLite provides many advantages
- files can be downloaded and subsequently queried without network latency
- compared to querying a static rdf, owl, or obo file, there is no startup/parse delay
- robust and performant
- excellent support in many languages
Although the focus is on SQLite, this library can also be used for other
DBMSs like PostgreSQL, MySQL, Oracle, etc
Here’s how we can use a Semantic SQL representation of GO with ontoProc2. The GOID vector was defined above.
library(ontoProc2)
gosem = retrieve_semsql_conn("go")
chk1 = dplyr::tbl(gosem, "statements") |>
dplyr::filter(subject %in% GOID, predicate=="rdfs:label") |>
dplyr::select(subject, value)
head(chk1)## # Source: SQL [?? x 2]
## # Database: sqlite 3.51.2 [/Users/vincentcarey/Library/Caches/org.R-project.R/R/BiocFileCache/40e293b372b_go.db]
## subject value
## <chr> <chr>
## 1 GO:0001960 negative regulation of cytokine-mediated signaling pathway
## 2 GO:0001961 positive regulation of cytokine-mediated signaling pathway
## 3 GO:0010803 regulation of tumor necrosis factor-mediated signaling pathway
## 4 GO:0060334 regulation of type II interferon-mediated signaling pathway
## 5 GO:0060338 regulation of type I interferon-mediated signaling pathway
## 6 GO:0070099 regulation of chemokine-mediated signaling pathway
Note that the full representation of all facts and relationships in GO entails a large footprint.
## [1] 1694437376
Upshots
- AnnotationDbi’s production of GO.db involves a complex pipeline with processes intermingled to address multiple objectives.
- AnnotationDbi’s “select” method is in conceptual conflict with the widely used select method of dplyr.
- GO.db3 is a near-complete functionally compatible replacement for GO.db.
- Production of GO.db3 is independent of all other annotation production processes; parquet production from OBO is documented in BiocGOprep.
-
GO.db3::select3is provided to emulateAnnotationDbi::select, but other interrogation interfaces based on the parquet representation should be devised. BiocGOprep produces parquet files following GO.db’s schema; this is not essential but was adopted to simplify producing compatibility with legacy approaches. - Both GO.db and GO.db3 present only a very limited view of the relationships encoded in Gene Ontology; a complete view is made available for SQL-based interrogation via ontoProc2.
- ontoProc2 helps provide full programmatic access to any ontology managed in the Semantic SQL project.
- It will be useful to work out exercises based on rols, to help understand the added value of specific curation through packaging.
org.*.eg.db: curating NCBI gene-oriented annotation
The org packages are built using a complex pipeline that has a number of shortcomings. We have examined its use of NCBI’s gene-oriented annotation and propose a different approach.
See RNCBIGene which is in development. Briefly, it is possible to
- create parquet representations of all gzipped text at NCBI’s ftp site,
- lodge these in egress-free cloud storage so that remote queries are supported,
- provide retrieval and caching support to support efficient local usage,
- avoid favoring any model organism for curated annotation support in Bioconductor.
Next steps
- Community members are invited to comment on implications of strategic choices on annotation curation for their work.
- GO.db3 should be brought to verifiable functional equivalence with GO.db.
- RNCBIGene should be enhanced to emulate those
org.*.dbfunctions that are in use in the ecosystem. - Other annotation elements should be looked at for opporunities to use more efficient representations and interfaces.
- It is of interest to note that there are no “first-class” representations of EFO or CL in Bioconductor. ontoProc2 and ontoProc address these ontologies, but are these sufficient?
Session information
## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.7.3
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ontoProc2_0.0.6 GO.db3_0.0.1 arrow_22.0.0.1
## [4] dplyr_1.2.0 GO.db_3.22.0 AnnotationDbi_1.72.0
## [7] IRanges_2.44.0 S4Vectors_0.48.0 Biobase_2.70.0
## [10] BiocGenerics_0.56.0 generics_0.1.4 BiocStyle_2.38.0
##
## loaded via a namespace (and not attached):
## [1] KEGGREST_1.50.0 xfun_0.56 bslib_0.10.0
## [4] httr2_1.2.2 htmlwidgets_1.6.4 ontologyPlot_1.7
## [7] vctrs_0.7.1 tools_4.5.2 curl_7.0.0
## [10] tibble_3.3.1 RSQLite_2.4.6 blob_1.3.0
## [13] R.oo_1.27.1 pkgconfig_2.0.3 dbplyr_2.5.1
## [16] desc_1.4.3 graph_1.88.1 assertthat_0.2.1
## [19] lifecycle_1.0.5 compiler_4.5.2 textshaping_1.0.4
## [22] Biostrings_2.78.0 Seqinfo_1.0.0 htmltools_0.5.9
## [25] sass_0.4.10 yaml_2.3.12 pkgdown_2.2.0
## [28] pillar_1.11.1 crayon_1.5.3 jquerylib_0.1.4
## [31] R.utils_2.13.0 cachem_1.1.0 tidyselect_1.2.1
## [34] digest_0.6.39 purrr_1.2.1 bookdown_0.46
## [37] paintmap_1.0 grid_4.5.2 fastmap_1.2.0
## [40] cli_3.6.5 magrittr_2.0.4 utf8_1.2.6
## [43] withr_3.0.2 filelock_1.0.3 rappdirs_0.3.4
## [46] bit64_4.6.0-1 rmarkdown_2.30 XVector_0.50.0
## [49] httr_1.4.7 bit_4.6.0 otel_0.2.0
## [52] R.methodsS3_1.8.2 ragg_1.5.0 png_0.1-8
## [55] memoise_2.0.1 evaluate_1.0.5 knitr_1.51
## [58] BiocFileCache_3.0.0 rlang_1.1.7 ontologyIndex_2.12
## [61] glue_1.8.0 DBI_1.2.3 Rgraphviz_2.55.001
## [64] BiocManager_1.30.27 jsonlite_2.0.0 R6_2.6.1
## [67] systemfonts_1.3.1 fs_1.6.6