Performant and standard representations of gene annotation for all organisms cataloged by NCBI
Purposes
Simplified, unified annotation for all organisms addressed by NCBI
The org.*.*.db packages are powerful and reliable but have a complex stack of schemata and scripts for generating organism-specific packages.
This package works the basis of a one-line transformation to parquet of compressed text from NCBI. The parquet files were placed in an NSF Open Storage Network bucket.
The geneFromCache function retrieves and caches the parquet files.
> example(geneFromCache)
gnFrmC> gi = geneFromCache("gene_info.parquet")
gnFrmC> arrow::open_dataset(gi) |> dplyr::filter(`#tax_id`==9606) |> head() |> dplyr::collect()
# A tibble: 6 × 16
`#tax_id` GeneID Symbol LocusTag Synonyms dbXrefs chromosome map_location
<int> <int> <chr> <chr> <chr> <chr> <chr> <chr>
1 9606 1 A1BG - A1B|ABG|GAB|… MIM:13… 19 19q13.43
2 9606 2 A2M - A2MD|CPAMD5|… MIM:10… 12 12p13.31
3 9606 3 A2MP1 - A2MP HGNC:H… 12 12p13.31
4 9606 9 NAT1 - AAC1|MNAT|NA… MIM:10… 8 8p22
5 9606 10 NAT2 - AAC2|NAT-2|P… MIM:61… 8 8p22
6 9606 11 NATP - AACP|NATP1 HGNC:H… 8 8p22
# ℹ 8 more variables: description <chr>, type_of_gene <chr>,
# Symbol_from_nomenclature_authority <chr>,
# Full_name_from_nomenclature_authority <chr>, Nomenclature_status <chr>,
# Other_designations <chr>, Modification_date <int>, Feature_type <chr>
Caching the resources involves transfers of 6GB of parquet; if this is too laborious, remote queries can be executed for specific needs:
> x = remote_gene_query(gres="gene_info", qual='where "#tax_id" = 9606 limit 10')
> x
# A tibble: 10 × 16
`#tax_id` GeneID Symbol LocusTag Synonyms dbXrefs chromosome map_location
<dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 9606 1 A1BG - A1B|ABG|G… MIM:13… 19 19q13.43
2 9606 2 A2M - A2MD|CPAM… MIM:10… 12 12p13.31
3 9606 3 A2MP1 - A2MP HGNC:H… 12 12p13.31
4 9606 9 NAT1 - AAC1|MNAT… MIM:10… 8 8p22
5 9606 10 NAT2 - AAC2|NAT-… MIM:61… 8 8p22
6 9606 11 NATP - AACP|NATP1 HGNC:H… 8 8p22
7 9606 12 SERPINA3 - AACT|ACT|… MIM:10… 14 14q32.13
8 9606 13 AADAC - CES5A1|DAC MIM:60… 3 3q25.1
9 9606 14 AAMP - - MIM:60… 2 2q35
10 9606 15 AANAT - DSPS|SNAT MIM:60… 17 17q25.1
# ℹ 8 more variables: description <chr>, type_of_gene <chr>,
# Symbol_from_nomenclature_authority <chr>,
# Full_name_from_nomenclature_authority <chr>, Nomenclature_status <chr>,
# Other_designations <chr>, Modification_date <dbl>, Feature_type <chr>
Annotation in a tidyverse style
By retaining the “flat file” model of the original all-organism annotation content at NCBI, we may more straightforwardly have access to annotation mappings in tidyverse-style programming. As an example, given TCGA expression data annotated with gene symbols, we can add the Gene (Entrez) Ids as follows.
> suppressMessages({
+ gbt = curatedTCGAData(diseaseCode="GBM",
+ assays="RNASeq2GeneNorm", version="2.0.1", dry.run=FALSE,
+ verbose=FALSE)
+ })
> gbtr = experiments(gbt)[[1]]
> library(tidyomics)
> library(RNCBIGene)
> gbtr[1:10,]
# A SummarizedExperiment-tibble abstraction: 1,660 × 2
# Features=10 | Samples=166 | Assays=
.feature .sample
<chr> <chr>
1 A1BG TCGA-02-0047-01A-01R-1849-01
2 A1CF TCGA-02-0047-01A-01R-1849-01
3 A2BP1 TCGA-02-0047-01A-01R-1849-01
4 A2LD1 TCGA-02-0047-01A-01R-1849-01
5 A2ML1 TCGA-02-0047-01A-01R-1849-01
6 A2M TCGA-02-0047-01A-01R-1849-01
7 A4GALT TCGA-02-0047-01A-01R-1849-01
8 A4GNT TCGA-02-0047-01A-01R-1849-01
9 AAA1 TCGA-02-0047-01A-01R-1849-01
10 AAAS TCGA-02-0047-01A-01R-1849-01
# ℹ 190 more rows
# ℹ Use `print(n = ...)` to see more rows
> gbtr2 = gbtr[1:1000,] |>
+ mutate(entrez = mapIdsNG(keys=.feature)$GeneID)
> gbtr2[1:10,]
# A SummarizedExperiment-tibble abstraction: 1,660 × 3
# Features=10 | Samples=166 | Assays=
.feature .sample entrez
<chr> <chr> <dbl>
1 A1BG TCGA-02-0047-01A-01R-1849-01 1
2 A1CF TCGA-02-0047-01A-01R-1849-01 29974
3 A2BP1 TCGA-02-0047-01A-01R-1849-01 NA
4 A2LD1 TCGA-02-0047-01A-01R-1849-01 NA
5 A2ML1 TCGA-02-0047-01A-01R-1849-01 144568
6 A2M TCGA-02-0047-01A-01R-1849-01 2
7 A4GALT TCGA-02-0047-01A-01R-1849-01 53947
8 A4GNT TCGA-02-0047-01A-01R-1849-01 51146
9 AAA1 TCGA-02-0047-01A-01R-1849-01 100329167
10 AAAS TCGA-02-0047-01A-01R-1849-01 8086
# ℹ 190 more rows
# ℹ Use `print(n = ...)` to see more rows
The restriction to the first 1000 features in the example above arises because an unrestricted attempt fails for a reason that is currently obscure.
Available resources
Contents of https://mghp.osn.xsede.org/bir190004-bucket01/BiocParquetNCBI, as retrieved on 22 Feb 2025 and then transformed to parquet (see inst/scripts for demonstration code):
3002338275 2025-05-17 09:28:29.618223689 gene2accession.parquet
680657744 2025-05-15 10:31:33.209868982 gene2go.parquet
89080561 2025-05-15 10:37:45.825464461 gene2pubmed.parquet
1467468877 2025-05-17 09:28:38.701230380 gene2refseq.parquet
965232147 2025-05-15 10:33:17.845775772 gene_info.parquet
43279811 2025-05-17 09:28:42.882232368 gene_orthologs.parquet
1019413239 2025-05-17 09:28:47.416233993 gene_refseq_uniprotkb_collab.parquet