Retrieve the top-k semantically closest EDAM terms per type — retrieve_edam

Embeds content using the provider recorded in edam_emb, computes cosine similarity against a pre-computed EDAM embedding matrix, and returns the top retrieve_k candidates per EDAM type (topic, operation, data, format).

Usage

retrieve_edam_candidates(
  content,
  edam_emb,
  retrieve_k = 75L,
  sim_threshold = 0.3,
  embed_model = edam_emb$model
)

Arguments

content: character(1) text to use as the query (e.g. a package description).
edam_emb: list as returned by get_edam_embeddings or make_edam_embeddings.
retrieve_k: integer(1) number of candidates to keep per type.
sim_threshold: numeric(1) minimum cosine similarity; candidates below this value are dropped before the LLM selection step. Defaults to 0.3.
embed_model: character(1) embedding model; must match the model used to build edam_emb. Defaults to edam_emb$model.

Value

named list of data.frames (topic, operation, data, format), each with columns id and lbl, ordered by descending cosine similarity to content.

Examples

# Model mismatch is caught before any API call
emb <- get_edam_embeddings()
#> Loading bundled EDAM embeddings from /private/var/folders/yw/gfhgh7k565v9w83x_k764wbc0000gp/T/RtmpbnJ5Gy/temp_libpathed7e7dd7ef17/biocEDAM/demo_embedding/edam_embeddings.rds
tryCatch(
    retrieve_edam_candidates("some text", emb,
                             embed_model = "text-embedding-3-large"),
    error = function(e) conditionMessage(e)
)
#> [1] "embed_model 'text-embedding-3-large' does not match the artifact model 'text-embedding-3-small'.\nRun make_edam_embeddings(model = 'text-embedding-3-large') to generate a matching artifact."

if (interactive()) {
    candidates <- retrieve_edam_candidates(
        "RNA-seq transcript quantification and metadata management",
        emb, retrieve_k = 5L)
    candidates$topic
}