R is a programming language focusing on activities important in statistical data analysis, but it is often thought of more as an interactive environment for working with data. This concept of “environment” will become clearer as we use R to perform progressively mode complex tasks.
(7+5)/(3*6)
#> [1] 0.6666667
s = "abc"
s
#> [1] "abc"
paste(s, s, sep="")
#> [1] "abcabc"
set.seed(1234) # setting a seed for reproducibility
x = rnorm(5)
x
#> [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247
In the example s = "abc"
, we formed a string constant by enclosing abc
in quotes and assigned it to the variable s
. This operation could also be written s <- "abc"
or "abc" -> s
.
The most common syntax for working with R was illustrated in the string operation illustration: paste(s, s, sep="")
. This operation took the string variables s
and concatenated it with itself. The additional function argument sep=""
arranges that the concatenation is to occur without any intervening text.
The examples illustrate the use of two more functions: rnorm
, and plot
.
The family of functions and data available in an R session depends upon the collection of packages attached in the session. Packages are used to collect related functions together with their documentation. We can add to the collection of functions available for plotting, and use the ggplot approach to designing visualizations, as follows.
library(ggplot2)
simpdf = data.frame(x,y)
ggplot(simpdf, aes(x=x, y=y)) + geom_point()
These questions should be considered in preparing to work with R for bioinformatics analysis.
The Bioconductor software/analysis/documentation ecosystem can help with answers to these questions. There are many facets to the ecosystem. Bioconductor users form an active community and post questions and answers at support.bioconductor.org
. You can often get input on answers to some of the questions above by browsing and searching that site and posting your own questions.
The BSgenome
package defines infrastructure that supports compact representation of genomic sequences. We’ll have a look at a recent reference build for baker’s yeast. For this to work, we need to know the name of the package that provides the genomic sequence, install it, and use appropriate functions to work with the BSgenome representation.
# BiocManager::install("BSgenome.Scerevisiae.UCSC.sacCer3") if necessary
library(BSgenome)
library(BSgenome.Scerevisiae.UCSC.sacCer3)
yeastg = BSgenome.Scerevisiae.UCSC.sacCer3
yeastg
#> Yeast genome:
#> # organism: Saccharomyces cerevisiae (Yeast)
#> # genome: sacCer3
#> # provider: UCSC
#> # release date: April 2011
#> # 17 sequences:
#> # chrI chrII chrIII chrIV chrV chrVI chrVII chrVIII chrIX
#> # chrX chrXI chrXII chrXIII chrXIV chrXV chrXVI chrM
#> # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
#> # to access a given sequence)
Evaluation of yeastg
in the R session produces a brief report on the BSgenome representation. We can look at chromosomal sequence with
yeastg$chrI
#> 230218-letter DNAString object
#> seq: CCACACCACACCCACACACCCACACACCACACCACA...GGTGTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGGG
What we call yeastg
is a Bioconductor representation of the genome of a model organism. Similar representations are available for more species:
length(available.genomes())
#> [1] 102
head(available.genomes())
#> [1] "BSgenome.Alyrata.JGI.v1"
#> [2] "BSgenome.Amellifera.BeeBase.assembly4"
#> [3] "BSgenome.Amellifera.NCBI.AmelHAv3.1"
#> [4] "BSgenome.Amellifera.UCSC.apiMel2"
#> [5] "BSgenome.Amellifera.UCSC.apiMel2.masked"
#> [6] "BSgenome.Aofficinalis.NCBI.V1"
Each of the genomes listed in the output of available_genomes()
can be obtained by installing the associated package, attaching the package and then operating with the eponymous object, as in yeastg = BSgenome.Scerevisiae.UCSC.sacCer3
. All of the objects representing the reference genomes of model organisms are instances of a ‘class’. Class definitions are used to group related data elements together.
class(yeastg)
#> [1] "BSgenome"
#> attr(,"package")
#> [1] "BSgenome"
One way of discovering the family of functions that have been defined to work with instances of a class is to use the methods
function:
methods(class="BSgenome")
#> [1] [[ $ as.list bsgenomeName
#> [5] coerce commonName countPWM export
#> [9] extractAt getSeq injectSNPs length
#> [13] masknames matchPWM metadata metadata<-
#> [17] mseqnames names organism provider
#> [21] providerVersion releaseDate releaseName seqinfo
#> [25] seqinfo<- seqnames seqnames<- show
#> [29] snpcount SNPlocs_pkgname snplocs sourceUrl
#> [33] vcountPattern vcountPDict Views vmatchPattern
#> [37] vmatchPDict
#> see '?methods' for accessing help and source code
Let’s try one:
atgm = vmatchPattern("ATG", yeastg)
atgm
#> GRanges object with 444764 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chrI 283-285 +
#> [2] chrI 335-337 +
#> [3] chrI 388-390 +
#> [4] chrI 436-438 +
#> [5] chrI 492-494 +
#> ... ... ... ...
#> [444760] chrM 85069-85071 -
#> [444761] chrM 85322-85324 -
#> [444762] chrM 85470-85472 -
#> [444763] chrM 85681-85683 -
#> [444764] chrM 85776-85778 -
#> -------
#> seqinfo: 17 sequences from an unspecified genome
This gives us the locations of all occurrences of ATG
in the yeast genome.
Model organism genomes and transcriptomes are catalogued at UCSC and Ensembl. The UCSC catalog is attached and used as follows:
library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)
ytx = TxDb.Scerevisiae.UCSC.sacCer3.sgdGene
class(ytx)
#> [1] "TxDb"
#> attr(,"package")
#> [1] "GenomicFeatures"
ytx
#> TxDb object:
#> # Db type: TxDb
#> # Supporting package: GenomicFeatures
#> # Data source: UCSC
#> # Genome: sacCer3
#> # Organism: Saccharomyces cerevisiae
#> # Taxonomy ID: 4932
#> # UCSC Table: sgdGene
#> # Resource URL: http://genome.ucsc.edu/
#> # Type of Gene ID: Name of canonical transcript in cluster
#> # Full dataset: yes
#> # miRBase build ID: NA
#> # transcript_nrow: 6692
#> # exon_nrow: 7034
#> # cds_nrow: 7034
#> # Db created by: GenomicFeatures package from Bioconductor
#> # Creation time: 2015-10-07 18:20:42 +0000 (Wed, 07 Oct 2015)
#> # GenomicFeatures version at creation time: 1.21.30
#> # RSQLite version at creation time: 1.0.0
#> # DBSCHEMAVERSION: 1.1
As before, we can discover available functions for working with such catalogues:
methods(class="TxDb")
#> [1] $ $<- annotatedDataFrameFrom
#> [4] as.list asBED asGFF
#> [7] assayData assayData<- cds
#> [10] cdsBy cdsByOverlaps coerce
#> [13] columns combine contents
#> [16] dbconn dbfile dbInfo
#> [19] dbmeta dbschema disjointExons
#> [22] distance exons exonsBy
#> [25] exonsByOverlaps ExpressionSet extractUpstreamSeqs
#> [28] featureNames featureNames<- fiveUTRsByTranscript
#> [31] genes initialize intronsByTranscript
#> [34] isActiveSeq isActiveSeq<- isNA
#> [37] keys keytypes mapIds
#> [40] mapIdsToRanges mappedkeys mapRangesToIds
#> [43] mapToTranscripts metadata microRNAs
#> [46] nhit organism promoters
#> [49] revmap sample sampleNames
#> [52] sampleNames<- saveDb select
#> [55] seqinfo seqinfo<- seqlevels<-
#> [58] seqlevels0 show species
#> [61] storageMode storageMode<- taxonomyId
#> [64] threeUTRsByTranscript transcripts transcriptsBy
#> [67] transcriptsByOverlaps tRNAs updateObject
#> see '?methods' for accessing help and source code
To obtain the DNA sequences of all yeast genes, use getSeq
: First, we need the addresses of the genes. We use a special representation called GRanges
.
y_g = genes(ytx)
y_g
#> GRanges object with 6534 ranges and 1 metadata column:
#> seqnames ranges strand | gene_id
#> <Rle> <IRanges> <Rle> | <character>
#> Q0010 chrM 3952-4415 + | Q0010
#> Q0032 chrM 11667-11957 + | Q0032
#> Q0055 chrM 13818-26701 + | Q0055
#> Q0075 chrM 24156-25255 + | Q0075
#> Q0080 chrM 27666-27812 + | Q0080
#> ... ... ... ... . ...
#> YPR200C chrXVI 939279-939671 - | YPR200C
#> YPR201W chrXVI 939922-941136 + | YPR201W
#> YPR202W chrXVI 943032-944188 + | YPR202W
#> YPR204C-A chrXVI 946856-947338 - | YPR204C-A
#> YPR204W chrXVI 944603-947701 + | YPR204W
#> -------
#> seqinfo: 17 sequences (1 circular) from sacCer3 genome
Instances of the GRanges
class are used to collect and annotate regions defined in genomic coordinates. getSeq(x,y)
extracts nucleotide sequences from a genome x
, at the addresses given by y
:
getSeq(yeastg, y_g)
#> DNAStringSet object of length 6534:
#> width seq names
#> [1] 464 ATATATTATATTATATTTTTAT...AATATACTTTATATTTTATAA Q0010
#> [2] 291 ATATTAATAATATATATATTAT...AAGTTTTTATATTCATTATAA Q0032
#> [3] 12884 ATGGTACAAAGATGATTATATT...ACACCAGCTGTACAATCTTAA Q0055
#> [4] 1100 ATATTAATATTATTAATAATAA...CATTATAACAATATCGAATAA Q0075
#> [5] 147 ATGCCACAATTAGTTCCATTTT...TTATTTATTTCTAAATTATAA Q0080
#> ... ... ...
#> [6530] 393 ATGGTAAGTTTCATAACGTCTA...AAATTGATTGTTAGTGGTTGA YPR200C
#> [6531] 1215 ATGTCAGAAGATCAAAAAAGTG...ATATGGAACAATAGAAATTAA YPR201W
#> [6532] 1157 ATGGAAATTGAAAACGAACGTA...CATGTGTGCTGCCCAAGTTGA YPR202W
#> [6533] 483 ATGATGCCTGCTAAACTGCAGC...CAGCATTCTCTCCACAGCTAG YPR204C-A
#> [6534] 3099 ATGGCAGACACACCCTCTGTGG...AGCAGAGAAGTTGGAGAGTGA YPR204W
For human studies, the same concepts apply.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
transcripts(txdb)
#> GRanges object with 82960 ranges and 2 metadata columns:
#> seqnames ranges strand | tx_id tx_name
#> <Rle> <IRanges> <Rle> | <integer> <character>
#> [1] chr1 11874-14409 + | 1 uc001aaa.3
#> [2] chr1 11874-14409 + | 2 uc010nxq.1
#> [3] chr1 11874-14409 + | 3 uc010nxr.1
#> [4] chr1 69091-70008 + | 4 uc001aal.1
#> [5] chr1 321084-321115 + | 5 uc001aaq.2
#> ... ... ... ... . ... ...
#> [82956] chrUn_gl000237 1-2686 - | 82956 uc011mgu.1
#> [82957] chrUn_gl000241 20433-36875 - | 82957 uc011mgv.2
#> [82958] chrUn_gl000243 11501-11530 + | 82958 uc011mgw.1
#> [82959] chrUn_gl000243 13608-13637 + | 82959 uc022brq.1
#> [82960] chrUn_gl000247 5787-5816 - | 82960 uc022brr.1
#> -------
#> seqinfo: 93 sequences (1 circular) from hg19 genome
We’ve pre-packaged a slice of the human transcriptome (for chr14) in the workshop-based package biocwk312
. This slice was massaged for visualization by functions in the TnT
package.
library(biocwk312)
library(TnT)
data(tnt_txtr_14)
tnt_txtr_14
#> A TxTrack
#> | Label: TxDb.Hsapiens.UCSC.hg19.knownGene
#> | Background: missing, use 'white'
#> | Height: 300
#> | Data:
#> | seqnames start end width strand tx_id tx_name
#> | <factor> <integer> <integer> <integer> <factor> <integer> <character>
#> | 1 chr14 19377594 19378574 981 + 50694 uc010tkp.2
#> | 2 chr14 19553365 19584942 31578 + 50695 uc001vuz.1
#> | 3 chr14 19553365 19584942 31578 + 50696 uc001vva.1
#> | 4 chr14 19553365 19584942 31578 + 50697 uc010ahc.1
#> | 5 chr14 19595544 19595584 41 + 50698 uc021rmy.1
#> | ... ... ... ... ... ... ... ...
#> | 2854 chr14 106576814 106598011 21198 - 53547 uc001ysv.3
#> | 2855 chr14 106804495 106805235 741 - 53548 uc001ysw.1
#> | 2856 chr14 106829593 106833706 4114 - 53549 uc001ysx.1
#> | 2857 chr14 107034169 107035191 1023 - 53550 uc001ysz.3
#> | 2858 chr14 107258705 107259792 1088 - 53551 uc001yta.1
#> | exons
#> | <list>
#> | 1 19377594:19378574:0:...,19377594:19378574:0:...,...
#> | 2 19553365:19553937:0:...,19558717:19558831:5352:...,19558991:19559164:5626:...,...
#> | 3 19553365:19553937:0:...,19558717:19558831:5352:...,19558991:19559164:5626:...,...
#> | 4 19553365:19553937:0:...,19558717:19558831:5352:...,19558991:19559164:5626:...,...
#> | 5 19595544:19595584:0:...
#> | ... ...
#> | 2854 106576814:106577071:0:...,106584837:106585512:8023:...,106586019:106586438:9205:...,...
#> | 2855 106804495:106804605:0:...,106805193:106805235:698:...,...
#> | 2856 106829593:106829895:0:...,106830971:106831203:1378:...,106832176:106832269:2583:...,...
#> | 2857 107034169:107035033:0:...,107035117:107035191:948:...,107034693:107035033:524:...,...
#> | 2858 107258705:107259632:0:...,107259716:107259792:1011:...,107259316:107259632:611:...,...
#> | display_label key id tooltip.tx_id tooltip.tx_name
#> | <character> <integer> <character> <integer> <character>
#> | 1 440153 uc010tkp.2 > 50694 50694 50694 uc010tkp.2
#> | 2 404785 uc001vuz.1 > 50695 50695 50695 uc001vuz.1
#> | 3 404785 uc001vva.1 > 50696 50696 50696 uc001vva.1
#> | 4 404785 uc010ahc.1 > 50697 50697 50697 uc010ahc.1
#> | 5 uc021rmy.1 > 50698 50698 50698 uc021rmy.1
#> | ... ... ... ... ... ...
#> | 2854 < uc001ysv.3 53547 53547 53547 uc001ysv.3
#> | 2855 < uc001ysw.1 53548 53548 53548 uc001ysw.1
#> | 2856 < uc001ysx.1 53549 53549 53549 uc001ysx.1
#> | 2857 < uc001ysz.3 53550 53550 53550 uc001ysz.3
#> | 2858 < uc001yta.1 53551 53551 53551 uc001yta.1
#> | color
#> | <character>
#> | 1 red
#> | 2 red
#> | 3 red
#> | 4 red
#> | 5 red
#> | ... ...
#> | 2854 red
#> | 2855 red
#> | 2856 red
#> | 2857 red
#> | 2858 red
Bioconductor has a number of packages that supply aligned RNA-seq reads in BAM format. We’ll use one that investigated the effects of knocking down a gene called HNRNPC in HeLa cells.
A character vector provides the paths to the BAM files:
length(RNAseqData.HNRNPC.bam.chr14_BAMFILES)
#> [1] 8
RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
#> ERR127306
#> "/usr/local/lib/R/site-library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127306_chr14.bam"
The 8 files are grouped: the first four are for wild-type cells, the next four are for the cells treated to knock down HNRNPC.
labels = c("wt1", "wt2", "wt3", "wt4", "kd1", "kd2", "kd3", "kd4")
The GenomicAlignments
package provides functions to parse and ingest information from BAM files.
library(GenomicAlignments)
readGAlignments(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1])
#> GAlignments object with 800484 alignments and 0 metadata columns:
#> seqnames strand cigar qwidth start end width
#> <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
#> [1] chr14 + 72M 72 19069583 19069654 72
#> [2] chr14 + 72M 72 19363738 19363809 72
#> [3] chr14 - 72M 72 19363755 19363826 72
#> [4] chr14 + 72M 72 19369799 19369870 72
#> [5] chr14 - 72M 72 19369828 19369899 72
#> ... ... ... ... ... ... ... ...
#> [800480] chr14 - 72M 72 106989780 106989851 72
#> [800481] chr14 + 72M 72 106994763 106994834 72
#> [800482] chr14 - 72M 72 106994819 106994890 72
#> [800483] chr14 + 72M 72 107003080 107003151 72
#> [800484] chr14 - 72M 72 107003171 107003242 72
#> njunc
#> <integer>
#> [1] 0
#> [2] 0
#> [3] 0
#> [4] 0
#> [5] 0
#> ... ...
#> [800480] 0
#> [800481] 0
#> [800482] 0
#> [800483] 0
#> [800484] 0
#> -------
#> seqinfo: 93 sequences from an unspecified genome
We’ll now select four of the BAM files to display coverage in the vicinity of the knocked-down gene. The function linetrack_covg_by_symbol
was written specifically to combine information on reads with positions of genes and transcripts.
allfi = lapply(c(1,2,5,6), function(x) {
curf = RNAseqData.HNRNPC.bam.chr14_BAMFILES[x]
linetrack_covg_by_symbol("HNRNPC", curf,
color="gray", radius=3e5, label=labels[x])
})
mytr = c(allfi, tnt_gt_sym, tnt_txtr_14)
TnTGenome(mytr)
#> - Missing argument `view.range`:
#> automatically select 21373939..22105960 on seqlevel chr14...