Use the extract_data facility defined in ellmer's doc to obtain summary information about textual content. Originally tailored to vignettes in bioconductor; it is newly generalized to handle any pdf, html or text in URL.

Usage

vig2data(
  url = "https://bioconductor.org/packages/release/bioc/html/Voyager.html",
  maxnchar = 30000,
  n_pdf_pages = 10,
  model = "gpt-4o",
  ...
)

Arguments

url: character(1) URL for an html bioconductor vignettes
maxnchar: numeric(1) text is truncated to a substring with this length
n_pdf_pages: numeric(1) maximum number of pages to extract text from for pdf vignettes
model: character(1) model for use with chat_openai, defaults to gpt-4o
...: passed to chat_openai

Value

a list with components author, topics, focused, coherence, and persuasion

Note

Based on code from https://cran.r-project.org/web/packages/ellmer/vignettes/structured-data.html March 15 2025. Requires that OPENAI_API_KEY is available in environment.

Examples

if (interactive()) {
# be sure OPENAI_API_KEY is available to Sys.getenv
tst = vig2data()
str(tst)
}
#> List of 5
#>  $ author    : chr [1:5] "Lambda Moses" "Alik Huseynov" "Kayla Jackson" "Laura Luebbert" ...
#>  $ topics    : chr [1:11] "S4 class" "spatial single-cell genomics" "exploratory spatial data analysis" "Moran's I" ...
#>  $ focused   : chr "Bioconductor 3.22 introduces the Voyager package, which employs advanced exploratory spatial data analysis (ESD"| __truncated__
#>  $ coherence : int 95
#>  $ persuasion: num 0.85