Skip to contents

Assign EDAM ontology terms to text using a live SemanticSQL database and an LLM

Usage

edamize(
  content_for_edam,
  provider = "anthropic",
  model = "claude-sonnet-4-5",
  nterms = 20L,
  prescrub = TRUE,
  prompt = read_prompt("edamize.txt"),
  retrieve_k = NULL,
  sim_threshold = 0.3,
  embed_model = "text-embedding-3-small",
  ...
)

Arguments

content_for_edam

character(1) text describing a bioinformatics resource

provider

character(1) LLM provider; see llm_env_var. Defaults to "anthropic".

model

character(1) model identifier for the selected provider. Defaults to "claude-sonnet-4-5".

nterms

integer(1) approximate number of EDAM terms to select. Defaults to 20.

prescrub

logical(1) if TRUE apply cleantxt before processing. Defaults to TRUE.

prompt

character(1) instruction text sent to the LLM before the EDAM vocabulary tables and content. The string must contain one %d placeholder that will be replaced by nterms. Defaults to the contents of inst/prompts/edamize.txt; supply your own string to customise curation behaviour without editing package files.

retrieve_k

integer(1) or NULL. When not NULL, use embedding-based retrieval (via retrieve_edam_candidates) to pre-filter the EDAM vocabulary to the top retrieve_k candidates per type before LLM selection. Requires the API key for the embedding provider recorded in the artifact (see get_edam_embeddings and llm_env_var). Set to NULL to pass the full vocabulary directly to the LLM. Defaults to 75L.

sim_threshold

numeric(1) minimum cosine similarity for a candidate term to be passed to the LLM. Terms below this threshold are dropped before the LLM selection step, reducing irrelevant tags. Only used when retrieve_k is not NULL. Defaults to 0.3.

embed_model

character(1) embedding model used for retrieval; must match the model used to build the artifact. Defaults to "text-embedding-3-small".

...

passed to llm_chat

Value

a data.frame with columns uri (full EDAM URI) and tm (term label), restricted to confirmed vocabulary entries and deduplicated. Compatible with mkdf, toline, and edam_graph.

Note

This function replaces the former Python/curbioc.py implementation. It connects to the current EDAM release via ontoProc2::semsql_connect() and selects terms using chat_structured() via ellmer, so no JSON schema validation loop is needed and hallucinated term labels are eliminated by post-filtering against the actual vocabulary.

Examples

# Input validation fires without any API key
tryCatch(edamize(list(a=1)), error = function(e) conditionMessage(e))
#> [1] "content_for_edam must be a single character string; did you mean to pass e.g. tst$focused?"

if (interactive() && nchar(Sys.getenv("ANTHROPIC_API_KEY")) > 0) {
  content <- readRDS(system.file("rds/tximetaFocused.rds", package="biocEDAM"))
  lk <- edamize(content$focused, retrieve_k = NULL)
  print(lk)
}