Curation of Bioconductor package metadata, targeting EDAM ontology and ELIXIR bio.tools metadata schemas
Vincent J. Carey, stvjc at channing.harvard.edu
March 18, 2025
Source:vignettes/curate.Rmd
curate.Rmd
Introduction
This vignette is derived almost entirely from collaborative code supplied by Anh Nguyet Vu of Sage Bionetworks. The purpose is to illustrate usage of OpenAPI transformation to provide systematic organization and tagging of content available for Bioconductor packages.
Code in this vignette requires that OPENAI_API_KEY
be defined.
Example 1: tximeta
We start with the transformation of vignette content, which may be in HTML or PDF, based on the structured data extraction code examples given in a vignette for the ellmer package on CRAN. We prompt GPT-4o to produce a concise and objective summary of at most 450 words, which is placed in the focus
component of the returned data.
if (nchar(Sys.getenv("OPENAI_API_KEY"))>0) {
library(biocEDAM)
content = vig2data("https://bioconductor.org/packages/release/bioc/vignettes/tximeta/inst/doc/tximeta.html")
str(content)
nchar(content$focus)
}
## List of 5
## $ author : chr [1:4] "Michael I. Love" "Charlotte Soneson" "Peter F. Hickey" "Rob Patro"
## $ topics : chr [1:5] "RNA-seq" "transcriptomics" "data import" "bioconductor" ...
## $ focused : chr "The tximeta package is designed to extend the functionality of the tximport package for importing transcript-le"| __truncated__
## $ coherence : int 97
## $ persuasion: num 0.95
## [1] 2457
We then use schema-driven inference to produce associated EDAM tags; see the code in inst/curbioc
in the package source.
if (nchar(Sys.getenv("OPENAI_API_KEY"))>0) {
substr(content$focus,1,250)
ans = edamize(content$focus)
DT::datatable(mkdf(ans))
}
## Loading required namespace: reticulate
## Success after 0 attempts
Example 2: MSnbase
if (nchar(Sys.getenv("OPENAI_API_KEY"))>0) {
mm = vig2data("https://bioconductor.org/packages/release/bioc/vignettes/MSnbase/inst/doc/v05-MSnbase-development.html")
uu = edamize(mm$focus)
if (is.null(uu)) uu = edamize(mm$focus) # second try
DT::datatable(mkdf(uu))
}
## Using model = "gpt-4o".
## JSON not valid, trying QC/correction prompt, attempt 1
## Success after 1 attempts
## Success after 0 attempts