biocEDAM: ontology for a genomic data science ecosystem
Vincent J. Carey, stvjc at channing.harvard.edu
November 07, 2024
Source:vignettes/biocEDAM.Rmd
biocEDAM.Rmd
Introduction
biocViews
The biocViews package collects and organizes terms for tagging resources in the Bioconductor ecosystem for genomic data science. As of November 2023 there are 497 terms defining classes of resources in the project. Example terms are “Organism”, “BiologicalQuestion”, “Sequencing”, “MicroarrayData”. Contributors and core members assign tags from this vocabulary to software packages, data resources, and workflows that are managed and distributed by the project.
BiocPkgTools is a package managing functions that interrogate aspects of the ecosystem. We obtain a table of all software packages and examine the views:
library(BiocPkgTools)
bl = biocPkgList(repo="BioCsoft")
library(dplyr)
s1 = bl |> select(Package, biocViews)
s1$tags = sapply(s1$biocViews, paste, collapse=":")
s1 = s1 |> select(Package, tags)
set.seed(1234)
s1[sample(seq_len(nrow(s1)), 10),]
## # A tibble: 10 × 2
## Package tags
## <chr> <chr>
## 1 INPower Software:AssayDomain:SNP
## 2 enhancerHomologSearch Software:Technology:BiologicalQuestion:WorkflowStep:Se…
## 3 HiCool Software:Technology:Sequencing:BiologicalQuestion:Infr…
## 4 combi Software:ResearchField:StatisticalMethod:Technology:Se…
## 5 VariantTools Software:ResearchField:AssayDomain:Technology:Genetics…
## 6 RCy3 Software:WorkflowStep:StatisticalMethod:Infrastructure…
## 7 MADSEQ Software:BiologicalQuestion:StatisticalMethod:AssayDom…
## 8 CatsCradle Software:AssayDomain:Technology:ResearchField:Biologic…
## 9 mCSEA Software:ResearchField:BiologicalQuestion:AssayDomain:…
## 10 tidyFlowCore Software:Technology:SingleCell:FlowCytometry:Infrastru…
EDAM
At edamontology.org, EDAM is described as “a comprehensive ontology of well-established, familiar concepts that are prevalent within bioscientific data analysis and data management (including computational biology, bioinformatics, and bioimage informatics). EDAM includes topics, operations, types of data and data identifiers, and data formats, relevant in data analysis and data management in life sciences.”
With a devel version of ontoProc, we ingest and sample from the EDAM ontology:
library(ontoProc)
epath = owl2cache(url="https://edamontology.org/EDAM_1.25.owl")
edam = setup_entities(epath)
set.seed(1234)
sample(labels(edam), 15)
## topic_3370
## "Analytical chemistry"
## topic_2258
## "Cheminformatics"
## topic_0618
## "Scents"
## data_2190
## "Sequence checksum"
## data_1446
## "Comparison matrix (integers)"
## topic_3301
## "Microbiology"
## operation_1822
## "Protein residue surface calculation (vacuum molecular)"
## data_2028
## "Experimental data"
## data_1141
## "TIGRFam ID"
## operation_3457
## "Single particle analysis"
## format_3816
## "Mol2"
## topic_0624
## "Chromosomes"
## operation_2461
## "Protein residue surface calculation"
## format_2076
## "RNA secondary structure format"
## operation_0298
## "Profile-profile alignment"
The main organizing categories in EDAM are “data”, “format”, “operation” and “topic”.
A preliminary comparison of the vocabularies
The Pypi package text2term was used to measure similarity between terms available in EDAM and terms of biocViews. The biocEDAM package includes a table of results, that we filter here for scores exceeding 0.8.
library(biocEDAM)
data(allmap)
ndf = allmap |> filter(`Mapping Score`>.8) |> select(`Source Term`,
`Mapped Term Label`, `Mapping Score`) |> as.data.frame()
library(DT)
datatable(ndf)
Similar programming can be used to examine biocViews terms with low maximum scores when matched against EDAM. These could indicate vocabulary gaps to be filled in EDAM, or could suggest alternative tagging methodology.
For example, biocViews includes “ExomeSeq”. This achieved scores of .70, .50, .39 for EDAM terms Exome sequencing, Exome assembly, and “geneseq” respectively. There are 9 software packages in Bioconductor 3.18 annotated to ExomeSeq. Dissection of their contents and additional views terms will be helpful for understanding the process needed to bridge EDAM to Bioconductor for improved discoverability of packages and data.