Use the extract_data facility defined in ellmer's doc to obtain summary information about textual content. Originally tailored to vignettes in bioconductor; it is newly generalized to handle any pdf, html or text in URL.
Source:R/vig2data.R
vig2data.RdUse the extract_data facility defined in ellmer's doc to obtain summary information about textual content. Originally tailored to vignettes in bioconductor; it is newly generalized to handle any pdf, html or text in URL.
Usage
vig2data(
url = "https://bioconductor.org/packages/release/bioc/html/Voyager.html",
maxnchar = 30000,
n_pdf_pages = 10,
model = "gpt-4o",
...
)Arguments
- url
character(1) URL for an html bioconductor vignettes
- maxnchar
numeric(1) text is truncated to a substring with this length
- n_pdf_pages
numeric(1) maximum number of pages to extract text from for pdf vignettes
- model
character(1) model for use with chat_openai, defaults to gpt-4o
- ...
passed to chat_openai
Note
Based on code from https://cran.r-project.org/web/packages/ellmer/vignettes/structured-data.html March 15 2025. Requires that OPENAI_API_KEY is available in environment.
Examples
if (interactive()) {
# be sure OPENAI_API_KEY is available to Sys.getenv
tst = vig2data()
str(tst)
}
#> List of 5
#> $ author : chr [1:5] "Lambda Moses" "Alik Huseynov" "Kayla Jackson" "Laura Luebbert" ...
#> $ topics : chr [1:11] "S4 class" "spatial single-cell genomics" "exploratory spatial data analysis" "Moran's I" ...
#> $ focused : chr "Bioconductor 3.22 introduces the Voyager package, which employs advanced exploratory spatial data analysis (ESD"| __truncated__
#> $ coherence : int 95
#> $ persuasion: num 0.85