Methods to extract genomic features from a GTFParquet object as GRanges. Unlike TxDb methods, these preserve all GTF attributes as metadata columns.
Usage
# S4 method for class 'GTFParquet'
genes(x, columns=NULL, filter=NULL, use_versioned_ids=FALSE)
# S4 method for class 'GTFParquet'
transcripts(x, columns=NULL, filter=NULL, use_versioned_ids=FALSE)
# S4 method for class 'GTFParquet'
exons(x, columns=NULL, filter=NULL, use_versioned_ids=FALSE)
# S4 method for class 'GTFParquet'
cds(x, columns=NULL, filter=NULL)
# S4 method for class 'GTFParquet'
transcripts(x, columns = NULL, filter = NULL, use_versioned_ids = FALSE)
# S4 method for class 'GTFParquet'
exons(x, columns = NULL, filter = NULL, use_versioned_ids = FALSE)
# S4 method for class 'GTFParquet'
cds(x, columns = NULL, filter = NULL)Arguments
- x
A
GTFParquetobject.- columns
Character vector of columns to include in
mcols. IfNULL(default), includes all available attribute columns. For genes:gene_name,gene_type,source,level,tags,havana_gene. For transcripts:transcript_name,transcript_type,gene_id,gene_name,transcript_support_level,ccdsid,protein_id.- filter
Optional named list for filtering features. Names should be column names, values are vectors of acceptable values. Example:
filter = list(gene_type = "protein_coding", chrom = "chr1")- use_versioned_ids
Logical. If
TRUE, use full versioned IDs (e.g.,ENSG00000141510.18). IfFALSE(default), use stripped IDs (e.g.,ENSG00000141510).
Value
A GRanges object with:
Feature IDs as
namesGenomic coordinates (
seqnames,ranges,strand)Genome build in
seqinfo(e.g., "GRCh38")Rich metadata in
mcols
Details
These methods return GRanges objects with feature IDs as names and rich metadata columns from the original GTF file.
The filter argument enables efficient server-side filtering through
Arrow/Parquet predicate pushdown, which can dramatically improve performance
compared to subsetting after loading.
Available filter columns include:
chrom: Chromosome namegene_type: Gene biotype (e.g., "protein_coding", "lncRNA")transcript_type: Transcript biotypelevel: Annotation confidence (1=verified, 2=manual, 3=automatic)source: Annotation source ("HAVANA", "ENSEMBL")
See also
GTFParquet-classfor the class definitiontranscriptsBy,GTFParquet-methodfor grouped extractiongenesfor the generic
Examples
if (FALSE) { # \dontrun{
gtf <- GTFParquet(system.file("gc49", package="lkparq"))
# Extract all genes with full attributes
gr <- genes(gtf)
mcols(gr) # gene_name, gene_type, level, tags, source, havana_gene
# Filter by gene type
pc <- genes(gtf, filter = list(gene_type = "protein_coding"))
lnc <- genes(gtf, filter = list(gene_type = "lncRNA"))
# Combine filters
pc_chr1 <- genes(gtf, filter = list(gene_type = "protein_coding", chrom = "chr1"))
# Select specific columns only
gr <- genes(gtf, columns = c("gene_name", "gene_type"))
# Use versioned IDs
gr <- genes(gtf, use_versioned_ids = TRUE)
names(gr)[1] # "ENSG00000141510.18"
# Transcripts with support level
tx <- transcripts(gtf)
high_conf <- tx[mcols(tx)$transcript_support_level == "1"]
# Exons
ex <- exons(gtf, filter = list(chrom = "chr1"))
# CDS with protein IDs
cds_gr <- cds(gtf)
mcols(cds_gr)$protein_id
} # }