Extract genomic features from a GTFParquet object

Methods to extract genomic features from a GTFParquet object as GRanges. Unlike TxDb methods, these preserve all GTF attributes as metadata columns.

Usage

# S4 method for class 'GTFParquet'
genes(x, columns=NULL, filter=NULL, use_versioned_ids=FALSE)

# S4 method for class 'GTFParquet'
transcripts(x, columns=NULL, filter=NULL, use_versioned_ids=FALSE)

# S4 method for class 'GTFParquet'
exons(x, columns=NULL, filter=NULL, use_versioned_ids=FALSE)

# S4 method for class 'GTFParquet'
cds(x, columns=NULL, filter=NULL)

# S4 method for class 'GTFParquet'
transcripts(x, columns = NULL, filter = NULL, use_versioned_ids = FALSE)

# S4 method for class 'GTFParquet'
exons(x, columns = NULL, filter = NULL, use_versioned_ids = FALSE)

# S4 method for class 'GTFParquet'
cds(x, columns = NULL, filter = NULL)

Arguments

x: A GTFParquet object.
columns: Character vector of columns to include in mcols. If NULL (default), includes all available attribute columns. For genes: gene_name, gene_type, source, level, tags, havana_gene. For transcripts: transcript_name, transcript_type, gene_id, gene_name, transcript_support_level, ccdsid, protein_id.
filter: Optional named list for filtering features. Names should be column names, values are vectors of acceptable values. Example: filter = list(gene_type = "protein_coding", chrom = "chr1")
use_versioned_ids: Logical. If TRUE, use full versioned IDs (e.g., ENSG00000141510.18). If FALSE (default), use stripped IDs (e.g., ENSG00000141510).

Value

A GRanges object with:

Feature IDs as names
Genomic coordinates (seqnames, ranges, strand)
Genome build in seqinfo (e.g., "GRCh38")
Rich metadata in mcols

Details

These methods return GRanges objects with feature IDs as names and rich metadata columns from the original GTF file.

The filter argument enables efficient server-side filtering through Arrow/Parquet predicate pushdown, which can dramatically improve performance compared to subsetting after loading.

Available filter columns include:

chrom: Chromosome name
gene_type: Gene biotype (e.g., "protein_coding", "lncRNA")
transcript_type: Transcript biotype
level: Annotation confidence (1=verified, 2=manual, 3=automatic)
source: Annotation source ("HAVANA", "ENSEMBL")

Examples

if (FALSE) { # \dontrun{
gtf <- GTFParquet(system.file("gc49", package="lkparq"))

# Extract all genes with full attributes
gr <- genes(gtf)
mcols(gr)  # gene_name, gene_type, level, tags, source, havana_gene

# Filter by gene type
pc <- genes(gtf, filter = list(gene_type = "protein_coding"))
lnc <- genes(gtf, filter = list(gene_type = "lncRNA"))

# Combine filters
pc_chr1 <- genes(gtf, filter = list(gene_type = "protein_coding", chrom = "chr1"))

# Select specific columns only
gr <- genes(gtf, columns = c("gene_name", "gene_type"))

# Use versioned IDs
gr <- genes(gtf, use_versioned_ids = TRUE)
names(gr)[1]  # "ENSG00000141510.18"

# Transcripts with support level
tx <- transcripts(gtf)
high_conf <- tx[mcols(tx)$transcript_support_level == "1"]

# Exons
ex <- exons(gtf, filter = list(chrom = "chr1"))

# CDS with protein IDs
cds_gr <- cds(gtf)
mcols(cds_gr)$protein_id
} # }