Skip to contents

Introduction

biocViews

The biocViews package collects and organizes terms for tagging resources in the Bioconductor ecosystem for genomic data science. As of November 2023 there are 497 terms defining classes of resources in the project. Example terms are “Organism”, “BiologicalQuestion”, “Sequencing”, “MicroarrayData”. Contributors and core members assign tags from this vocabulary to software packages, data resources, and workflows that are managed and distributed by the project.

BiocPkgTools is a package managing functions that interrogate aspects of the ecosystem. We obtain a table of all software packages and examine the views:

library(BiocPkgTools)
bl = biocPkgList(repo="BioCsoft")
library(dplyr)
s1 = bl |> select(Package, biocViews)
s1$tags = sapply(s1$biocViews, paste, collapse=":")
s1 = s1 |> select(Package, tags)
set.seed(1234)
s1[sample(seq_len(nrow(s1)), 10),]
## # A tibble: 10 × 2
##    Package               tags                                                   
##    <chr>                 <chr>                                                  
##  1 INPower               Software:AssayDomain:SNP                               
##  2 enhancerHomologSearch Software:Technology:BiologicalQuestion:WorkflowStep:Se…
##  3 HiCool                Software:Technology:Sequencing:BiologicalQuestion:Infr…
##  4 combi                 Software:ResearchField:StatisticalMethod:Technology:Se…
##  5 VariantTools          Software:ResearchField:AssayDomain:Technology:Genetics…
##  6 RCy3                  Software:WorkflowStep:StatisticalMethod:Infrastructure…
##  7 MADSEQ                Software:BiologicalQuestion:StatisticalMethod:AssayDom…
##  8 CatsCradle            Software:AssayDomain:Technology:ResearchField:Biologic…
##  9 mCSEA                 Software:ResearchField:BiologicalQuestion:AssayDomain:…
## 10 tidyFlowCore          Software:Technology:SingleCell:FlowCytometry:Infrastru…

EDAM

At edamontology.org, EDAM is described as “a comprehensive ontology of well-established, familiar concepts that are prevalent within bioscientific data analysis and data management (including computational biology, bioinformatics, and bioimage informatics). EDAM includes topics, operations, types of data and data identifiers, and data formats, relevant in data analysis and data management in life sciences.”

With a devel version of ontoProc, we ingest and sample from the EDAM ontology:

library(ontoProc)
epath = owl2cache(url="https://edamontology.org/EDAM_1.25.owl")
edam = setup_entities(epath)
set.seed(1234)
sample(labels(edam), 15)
##                                               topic_3370 
##                                   "Analytical chemistry" 
##                                               topic_2258 
##                                        "Cheminformatics" 
##                                               topic_0618 
##                                                 "Scents" 
##                                                data_2190 
##                                      "Sequence checksum" 
##                                                data_1446 
##                           "Comparison matrix (integers)" 
##                                               topic_3301 
##                                           "Microbiology" 
##                                           operation_1822 
## "Protein residue surface calculation (vacuum molecular)" 
##                                                data_2028 
##                                      "Experimental data" 
##                                                data_1141 
##                                             "TIGRFam ID" 
##                                           operation_3457 
##                               "Single particle analysis" 
##                                              format_3816 
##                                                   "Mol2" 
##                                               topic_0624 
##                                            "Chromosomes" 
##                                           operation_2461 
##                    "Protein residue surface calculation" 
##                                              format_2076 
##                         "RNA secondary structure format" 
##                                           operation_0298 
##                              "Profile-profile alignment"

The main organizing categories in EDAM are “data”, “format”, “operation” and “topic”.

A preliminary comparison of the vocabularies

The Pypi package text2term was used to measure similarity between terms available in EDAM and terms of biocViews. The biocEDAM package includes a table of results, that we filter here for scores exceeding 0.8.

library(biocEDAM)
data(allmap)
ndf = allmap |> filter(`Mapping Score`>.8) |> select(`Source Term`, 
              `Mapped Term Label`, `Mapping Score`) |> as.data.frame()
library(DT)
datatable(ndf)

Similar programming can be used to examine biocViews terms with low maximum scores when matched against EDAM. These could indicate vocabulary gaps to be filled in EDAM, or could suggest alternative tagging methodology.

For example, biocViews includes “ExomeSeq”. This achieved scores of .70, .50, .39 for EDAM terms Exome sequencing, Exome assembly, and “geneseq” respectively. There are 9 software packages in Bioconductor 3.18 annotated to ExomeSeq. Dissection of their contents and additional views terms will be helpful for understanding the process needed to bridge EDAM to Bioconductor for improved discoverability of packages and data.