Assign EDAM ontology terms to text using a live SemanticSQL database and an LLM
Source:R/edamize.R
edamize.RdAssign EDAM ontology terms to text using a live SemanticSQL database and an LLM
Usage
edamize(
content_for_edam,
provider = "anthropic",
model = "claude-sonnet-4-5",
nterms = 20L,
prescrub = TRUE,
prompt = read_prompt("edamize.txt"),
retrieve_k = NULL,
sim_threshold = 0.3,
embed_model = "text-embedding-3-small",
...
)Arguments
- content_for_edam
character(1) text describing a bioinformatics resource
- provider
character(1) LLM provider; see
llm_env_var. Defaults to "anthropic".- model
character(1) model identifier for the selected provider. Defaults to "claude-sonnet-4-5".
- nterms
integer(1) approximate number of EDAM terms to select. Defaults to 20.
- prescrub
logical(1) if TRUE apply
cleantxtbefore processing. Defaults to TRUE.- prompt
character(1) instruction text sent to the LLM before the EDAM vocabulary tables and content. The string must contain one
%dplaceholder that will be replaced bynterms. Defaults to the contents ofinst/prompts/edamize.txt; supply your own string to customise curation behaviour without editing package files.- retrieve_k
integer(1) or NULL. When not NULL, use embedding-based retrieval (via
retrieve_edam_candidates) to pre-filter the EDAM vocabulary to the topretrieve_kcandidates per type before LLM selection. Requires the API key for the embedding provider recorded in the artifact (seeget_edam_embeddingsandllm_env_var). Set toNULLto pass the full vocabulary directly to the LLM. Defaults to 75L.- sim_threshold
numeric(1) minimum cosine similarity for a candidate term to be passed to the LLM. Terms below this threshold are dropped before the LLM selection step, reducing irrelevant tags. Only used when
retrieve_kis not NULL. Defaults to 0.3.- embed_model
character(1) embedding model used for retrieval; must match the model used to build the artifact. Defaults to
"text-embedding-3-small".- ...
passed to
llm_chat
Value
a data.frame with columns uri (full EDAM URI) and tm (term label),
restricted to confirmed vocabulary entries and deduplicated. Compatible with
mkdf, toline, and edam_graph.
Note
This function replaces the former Python/curbioc.py implementation.
It connects to the current EDAM release via ontoProc2::semsql_connect() and
selects terms using chat_structured() via ellmer, so no JSON schema validation
loop is needed and hallucinated term labels are eliminated by post-filtering against
the actual vocabulary.
Examples
# Input validation fires without any API key
tryCatch(edamize(list(a=1)), error = function(e) conditionMessage(e))
#> [1] "content_for_edam must be a single character string; did you mean to pass e.g. tst$focused?"
if (interactive() && nchar(Sys.getenv("ANTHROPIC_API_KEY")) > 0) {
content <- readRDS(system.file("rds/tximetaFocused.rds", package="biocEDAM"))
lk <- edamize(content$focused, retrieve_k = NULL)
print(lk)
}