Curation of Bioconductor package metadata, targeting EDAM ontology and ELIXIR bio.tools metadata schemas

Introduction

This vignette is derived almost entirely from collaborative code supplied by Anh Nguyet Vu of Sage Bionetworks. The purpose is to illustrate usage of OpenAPI transformation to provide systematic organization and tagging of content available for Bioconductor packages.

Code in this vignette requires that OPENAI_API_KEY be defined.

Example 1: tximeta

We start with the transformation of vignette content, which may be in HTML or PDF, based on the structured data extraction code examples given in a vignette for the ellmer package on CRAN. We prompt GPT-4o to produce a concise and objective summary of at most 450 words, which is placed in the focus component of the returned data.

if (nchar(Sys.getenv("OPENAI_API_KEY"))>0) {
library(biocEDAM)
content = vig2data("https://bioconductor.org/packages/release/bioc/vignettes/tximeta/inst/doc/tximeta.html")
str(content)
nchar(content$focus)
}

## List of 5
##  $ author    : chr [1:4] "Michael I. Love" "Charlotte Soneson" "Peter F. Hickey" "Rob Patro"
##  $ topics    : chr [1:5] "RNA-seq" "transcriptomics" "data import" "bioconductor" ...
##  $ focused   : chr "The tximeta package is designed to extend the functionality of the tximport package for importing transcript-le"| __truncated__
##  $ coherence : int 97
##  $ persuasion: num 0.95

## [1] 2457

We then use schema-driven inference to produce associated EDAM tags; see the code in inst/curbioc in the package source.

if (nchar(Sys.getenv("OPENAI_API_KEY"))>0) {
substr(content$focus,1,250)
ans = edamize(content$focus)
DT::datatable(mkdf(ans))
}

## Loading required namespace: reticulate

## Success after 0 attempts

Example 2: MSnbase

if (nchar(Sys.getenv("OPENAI_API_KEY"))>0) {
mm = vig2data("https://bioconductor.org/packages/release/bioc/vignettes/MSnbase/inst/doc/v05-MSnbase-development.html")
uu = edamize(mm$focus)
if (is.null(uu)) uu = edamize(mm$focus)  # second try
DT::datatable(mkdf(uu))
}

## Using model = "gpt-4o".

## JSON not valid, trying QC/correction prompt, attempt 1
## Success after 1 attempts
## Success after 0 attempts

Caveats

Sometimes there is no result. This pertains to indeterminacy in the GPT environment we are using. Often a second try will get a result. If you have persistent trouble, please file an issue.

Vincent J. Carey, stvjc at channing.harvard.edu

March 18, 2025

Introduction

Example 1: tximeta

Example 2: MSnbase

Caveats