Prerequisites and motivating examples

Pre-requisites

  • Basic knowledge of R syntax
  • Familiarity with Rstudio
  • Basic understanding of modern genomics. For example, the distinction between whole genome sequencing and RNA-seq should be clear.
  • Basic familiarity with the concepts of statistical analysis, such as the definition of the t-test for comparing sample means, the interpretation of histograms. An understanding of experimental design is helpful.
  • Optional readings:

Programming and package use with R 4.0.3

R is a programming language focusing on activities important in statistical data analysis, but it is often thought of more as an interactive environment for working with data. This concept of “environment” will become clearer as we use R to perform progressively mode complex tasks.

Simple illustrations

  • simple arithmetic
(7+5)/(3*6)
#> [1] 0.6666667
  • assignment; string definitions and operations
s = "abc"
s
#> [1] "abc"
paste(s, s, sep="")
#> [1] "abcabc"
  • random number generation
set.seed(1234) # setting a seed for reproducibility
x = rnorm(5)
x
#> [1] -1.2070657  0.2774292  1.0844412 -2.3456977  0.4291247
  • data visualization
y = rnorm(5)
plot(x,y)

Making new variables

In the example s = "abc", we formed a string constant by enclosing abc in quotes and assigned it to the variable s. This operation could also be written s <- "abc" or "abc" -> s.

Calling functions

The most common syntax for working with R was illustrated in the string operation illustration: paste(s, s, sep=""). This operation took the string variables s and concatenated it with itself. The additional function argument sep="" arranges that the concatenation is to occur without any intervening text.

The examples illustrate the use of two more functions: rnorm, and plot.

Packages

The family of functions and data available in an R session depends upon the collection of packages attached in the session. Packages are used to collect related functions together with their documentation. We can add to the collection of functions available for plotting, and use the ggplot approach to designing visualizations, as follows.

library(ggplot2)
simpdf = data.frame(x,y)
ggplot(simpdf, aes(x=x, y=y)) + geom_point()

Summary: strategies for bioinformatics with R/Bioconductor

These questions should be considered in preparing to work with R for bioinformatics analysis.

  • What question do I want to answer with this analysis?
  • What data am I going to use? Have I clearly documented the production of the data, including its origin, quality assessment, data editing, sample annotations?
  • What is a clear verbal description of the analysis to be performed?
  • What R/Bioconductor packages can be used to conduct the analysis?
  • If the analysis consists of a sequence of steps, should I save interim results in case an error occurs somewhere downstream?
  • What steps will I take to ensure that the analysis is reproducible by independent parties?

The Bioconductor software/analysis/documentation ecosystem can help with answers to these questions. There are many facets to the ecosystem. Bioconductor users form an active community and post questions and answers at support.bioconductor.org. You can often get input on answers to some of the questions above by browsing and searching that site and posting your own questions.

Basics of genome annotation: reference genomes, transcriptomes, functional annotation

Reference genomic sequence

The BSgenome package defines infrastructure that supports compact representation of genomic sequences. We’ll have a look at a recent reference build for baker’s yeast. For this to work, we need to know the name of the package that provides the genomic sequence, install it, and use appropriate functions to work with the BSgenome representation.

# BiocManager::install("BSgenome.Scerevisiae.UCSC.sacCer3") if necessary
library(BSgenome)
library(BSgenome.Scerevisiae.UCSC.sacCer3)
yeastg = BSgenome.Scerevisiae.UCSC.sacCer3
yeastg
#> Yeast genome:
#> # organism: Saccharomyces cerevisiae (Yeast)
#> # genome: sacCer3
#> # provider: UCSC
#> # release date: April 2011
#> # 17 sequences:
#> #   chrI    chrII   chrIII  chrIV   chrV    chrVI   chrVII  chrVIII chrIX  
#> #   chrX    chrXI   chrXII  chrXIII chrXIV  chrXV   chrXVI  chrM           
#> # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
#> # to access a given sequence)

Evaluation of yeastg in the R session produces a brief report on the BSgenome representation. We can look at chromosomal sequence with

yeastg$chrI
#> 230218-letter DNAString object
#> seq: CCACACCACACCCACACACCCACACACCACACCACA...GGTGTGTGGGTGTGGTGTGGGTGTGGTGTGTGTGGG

What we call yeastg is a Bioconductor representation of the genome of a model organism. Similar representations are available for more species:

length(available.genomes())
#> [1] 102
head(available.genomes())
#> [1] "BSgenome.Alyrata.JGI.v1"                
#> [2] "BSgenome.Amellifera.BeeBase.assembly4"  
#> [3] "BSgenome.Amellifera.NCBI.AmelHAv3.1"    
#> [4] "BSgenome.Amellifera.UCSC.apiMel2"       
#> [5] "BSgenome.Amellifera.UCSC.apiMel2.masked"
#> [6] "BSgenome.Aofficinalis.NCBI.V1"

Each of the genomes listed in the output of available_genomes() can be obtained by installing the associated package, attaching the package and then operating with the eponymous object, as in yeastg = BSgenome.Scerevisiae.UCSC.sacCer3. All of the objects representing the reference genomes of model organisms are instances of a ‘class’. Class definitions are used to group related data elements together.

class(yeastg)
#> [1] "BSgenome"
#> attr(,"package")
#> [1] "BSgenome"

One way of discovering the family of functions that have been defined to work with instances of a class is to use the methods function:

methods(class="BSgenome")
#>  [1] [[              $               as.list         bsgenomeName   
#>  [5] coerce          commonName      countPWM        export         
#>  [9] extractAt       getSeq          injectSNPs      length         
#> [13] masknames       matchPWM        metadata        metadata<-     
#> [17] mseqnames       names           organism        provider       
#> [21] providerVersion releaseDate     releaseName     seqinfo        
#> [25] seqinfo<-       seqnames        seqnames<-      show           
#> [29] snpcount        SNPlocs_pkgname snplocs         sourceUrl      
#> [33] vcountPattern   vcountPDict     Views           vmatchPattern  
#> [37] vmatchPDict    
#> see '?methods' for accessing help and source code

Let’s try one:

atgm = vmatchPattern("ATG", yeastg)
atgm
#> GRanges object with 444764 ranges and 0 metadata columns:
#>            seqnames      ranges strand
#>               <Rle>   <IRanges>  <Rle>
#>        [1]     chrI     283-285      +
#>        [2]     chrI     335-337      +
#>        [3]     chrI     388-390      +
#>        [4]     chrI     436-438      +
#>        [5]     chrI     492-494      +
#>        ...      ...         ...    ...
#>   [444760]     chrM 85069-85071      -
#>   [444761]     chrM 85322-85324      -
#>   [444762]     chrM 85470-85472      -
#>   [444763]     chrM 85681-85683      -
#>   [444764]     chrM 85776-85778      -
#>   -------
#>   seqinfo: 17 sequences from an unspecified genome

This gives us the locations of all occurrences of ATG in the yeast genome.

The TxDb family

Model organism genomes and transcriptomes are catalogued at UCSC and Ensembl. The UCSC catalog is attached and used as follows:

library(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene)
ytx = TxDb.Scerevisiae.UCSC.sacCer3.sgdGene
class(ytx)
#> [1] "TxDb"
#> attr(,"package")
#> [1] "GenomicFeatures"
ytx
#> TxDb object:
#> # Db type: TxDb
#> # Supporting package: GenomicFeatures
#> # Data source: UCSC
#> # Genome: sacCer3
#> # Organism: Saccharomyces cerevisiae
#> # Taxonomy ID: 4932
#> # UCSC Table: sgdGene
#> # Resource URL: http://genome.ucsc.edu/
#> # Type of Gene ID: Name of canonical transcript in cluster
#> # Full dataset: yes
#> # miRBase build ID: NA
#> # transcript_nrow: 6692
#> # exon_nrow: 7034
#> # cds_nrow: 7034
#> # Db created by: GenomicFeatures package from Bioconductor
#> # Creation time: 2015-10-07 18:20:42 +0000 (Wed, 07 Oct 2015)
#> # GenomicFeatures version at creation time: 1.21.30
#> # RSQLite version at creation time: 1.0.0
#> # DBSCHEMAVERSION: 1.1

As before, we can discover available functions for working with such catalogues:

methods(class="TxDb")
#>  [1] $                      $<-                    annotatedDataFrameFrom
#>  [4] as.list                asBED                  asGFF                 
#>  [7] assayData              assayData<-            cds                   
#> [10] cdsBy                  cdsByOverlaps          coerce                
#> [13] columns                combine                contents              
#> [16] dbconn                 dbfile                 dbInfo                
#> [19] dbmeta                 dbschema               disjointExons         
#> [22] distance               exons                  exonsBy               
#> [25] exonsByOverlaps        ExpressionSet          extractUpstreamSeqs   
#> [28] featureNames           featureNames<-         fiveUTRsByTranscript  
#> [31] genes                  initialize             intronsByTranscript   
#> [34] isActiveSeq            isActiveSeq<-          isNA                  
#> [37] keys                   keytypes               mapIds                
#> [40] mapIdsToRanges         mappedkeys             mapRangesToIds        
#> [43] mapToTranscripts       metadata               microRNAs             
#> [46] nhit                   organism               promoters             
#> [49] revmap                 sample                 sampleNames           
#> [52] sampleNames<-          saveDb                 select                
#> [55] seqinfo                seqinfo<-              seqlevels<-           
#> [58] seqlevels0             show                   species               
#> [61] storageMode            storageMode<-          taxonomyId            
#> [64] threeUTRsByTranscript  transcripts            transcriptsBy         
#> [67] transcriptsByOverlaps  tRNAs                  updateObject          
#> see '?methods' for accessing help and source code

To obtain the DNA sequences of all yeast genes, use getSeq: First, we need the addresses of the genes. We use a special representation called GRanges.

y_g = genes(ytx)
y_g
#> GRanges object with 6534 ranges and 1 metadata column:
#>             seqnames        ranges strand |     gene_id
#>                <Rle>     <IRanges>  <Rle> | <character>
#>       Q0010     chrM     3952-4415      + |       Q0010
#>       Q0032     chrM   11667-11957      + |       Q0032
#>       Q0055     chrM   13818-26701      + |       Q0055
#>       Q0075     chrM   24156-25255      + |       Q0075
#>       Q0080     chrM   27666-27812      + |       Q0080
#>         ...      ...           ...    ... .         ...
#>     YPR200C   chrXVI 939279-939671      - |     YPR200C
#>     YPR201W   chrXVI 939922-941136      + |     YPR201W
#>     YPR202W   chrXVI 943032-944188      + |     YPR202W
#>   YPR204C-A   chrXVI 946856-947338      - |   YPR204C-A
#>     YPR204W   chrXVI 944603-947701      + |     YPR204W
#>   -------
#>   seqinfo: 17 sequences (1 circular) from sacCer3 genome

Instances of the GRanges class are used to collect and annotate regions defined in genomic coordinates. getSeq(x,y) extracts nucleotide sequences from a genome x, at the addresses given by y:

getSeq(yeastg, y_g)
#> DNAStringSet object of length 6534:
#>        width seq                                            names               
#>    [1]   464 ATATATTATATTATATTTTTAT...AATATACTTTATATTTTATAA Q0010
#>    [2]   291 ATATTAATAATATATATATTAT...AAGTTTTTATATTCATTATAA Q0032
#>    [3] 12884 ATGGTACAAAGATGATTATATT...ACACCAGCTGTACAATCTTAA Q0055
#>    [4]  1100 ATATTAATATTATTAATAATAA...CATTATAACAATATCGAATAA Q0075
#>    [5]   147 ATGCCACAATTAGTTCCATTTT...TTATTTATTTCTAAATTATAA Q0080
#>    ...   ... ...
#> [6530]   393 ATGGTAAGTTTCATAACGTCTA...AAATTGATTGTTAGTGGTTGA YPR200C
#> [6531]  1215 ATGTCAGAAGATCAAAAAAGTG...ATATGGAACAATAGAAATTAA YPR201W
#> [6532]  1157 ATGGAAATTGAAAACGAACGTA...CATGTGTGCTGCCCAAGTTGA YPR202W
#> [6533]   483 ATGATGCCTGCTAAACTGCAGC...CAGCATTCTCTCCACAGCTAG YPR204C-A
#> [6534]  3099 ATGGCAGACACACCCTCTGTGG...AGCAGAGAAGTTGGAGAGTGA YPR204W

For human studies, the same concepts apply.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb = TxDb.Hsapiens.UCSC.hg19.knownGene
transcripts(txdb)
#> GRanges object with 82960 ranges and 2 metadata columns:
#>                 seqnames        ranges strand |     tx_id     tx_name
#>                    <Rle>     <IRanges>  <Rle> | <integer> <character>
#>       [1]           chr1   11874-14409      + |         1  uc001aaa.3
#>       [2]           chr1   11874-14409      + |         2  uc010nxq.1
#>       [3]           chr1   11874-14409      + |         3  uc010nxr.1
#>       [4]           chr1   69091-70008      + |         4  uc001aal.1
#>       [5]           chr1 321084-321115      + |         5  uc001aaq.2
#>       ...            ...           ...    ... .       ...         ...
#>   [82956] chrUn_gl000237        1-2686      - |     82956  uc011mgu.1
#>   [82957] chrUn_gl000241   20433-36875      - |     82957  uc011mgv.2
#>   [82958] chrUn_gl000243   11501-11530      + |     82958  uc011mgw.1
#>   [82959] chrUn_gl000243   13608-13637      + |     82959  uc022brq.1
#>   [82960] chrUn_gl000247     5787-5816      - |     82960  uc022brr.1
#>   -------
#>   seqinfo: 93 sequences (1 circular) from hg19 genome

Using a transcript catalog to visualize RNA-seq data

We’ve pre-packaged a slice of the human transcriptome (for chr14) in the workshop-based package biocwk312. This slice was massaged for visualization by functions in the TnT package.

library(biocwk312)
library(TnT)
data(tnt_txtr_14)
tnt_txtr_14
#> A TxTrack 
#> | Label: TxDb.Hsapiens.UCSC.hg19.knownGene
#> | Background:    missing, use 'white'
#> | Height:    300
#> | Data:  
#> |       seqnames     start       end     width   strand     tx_id     tx_name
#> |       <factor> <integer> <integer> <integer> <factor> <integer> <character>
#> |  1       chr14  19377594  19378574       981        +     50694  uc010tkp.2
#> |  2       chr14  19553365  19584942     31578        +     50695  uc001vuz.1
#> |  3       chr14  19553365  19584942     31578        +     50696  uc001vva.1
#> |  4       chr14  19553365  19584942     31578        +     50697  uc010ahc.1
#> |  5       chr14  19595544  19595584        41        +     50698  uc021rmy.1
#> |  ...       ...       ...       ...       ...      ...       ...         ...
#> |  2854    chr14 106576814 106598011     21198        -     53547  uc001ysv.3
#> |  2855    chr14 106804495 106805235       741        -     53548  uc001ysw.1
#> |  2856    chr14 106829593 106833706      4114        -     53549  uc001ysx.1
#> |  2857    chr14 107034169 107035191      1023        -     53550  uc001ysz.3
#> |  2858    chr14 107258705 107259792      1088        -     53551  uc001yta.1
#> |                                                                                         exons
#> |                                                                                        <list>
#> |  1                                        19377594:19378574:0:...,19377594:19378574:0:...,...
#> |  2          19553365:19553937:0:...,19558717:19558831:5352:...,19558991:19559164:5626:...,...
#> |  3          19553365:19553937:0:...,19558717:19558831:5352:...,19558991:19559164:5626:...,...
#> |  4          19553365:19553937:0:...,19558717:19558831:5352:...,19558991:19559164:5626:...,...
#> |  5                                                                    19595544:19595584:0:...
#> |  ...                                                                                      ...
#> |  2854 106576814:106577071:0:...,106584837:106585512:8023:...,106586019:106586438:9205:...,...
#> |  2855                               106804495:106804605:0:...,106805193:106805235:698:...,...
#> |  2856 106829593:106829895:0:...,106830971:106831203:1378:...,106832176:106832269:2583:...,...
#> |  2857   107034169:107035033:0:...,107035117:107035191:948:...,107034693:107035033:524:...,...
#> |  2858  107258705:107259632:0:...,107259716:107259792:1011:...,107259316:107259632:611:...,...
#> |             display_label       key          id tooltip.tx_id tooltip.tx_name
#> |               <character> <integer> <character>     <integer>     <character>
#> |  1    440153 uc010tkp.2 >     50694       50694         50694      uc010tkp.2
#> |  2    404785 uc001vuz.1 >     50695       50695         50695      uc001vuz.1
#> |  3    404785 uc001vva.1 >     50696       50696         50696      uc001vva.1
#> |  4    404785 uc010ahc.1 >     50697       50697         50697      uc010ahc.1
#> |  5           uc021rmy.1 >     50698       50698         50698      uc021rmy.1
#> |  ...                  ...       ...         ...           ...             ...
#> |  2854        < uc001ysv.3     53547       53547         53547      uc001ysv.3
#> |  2855        < uc001ysw.1     53548       53548         53548      uc001ysw.1
#> |  2856        < uc001ysx.1     53549       53549         53549      uc001ysx.1
#> |  2857        < uc001ysz.3     53550       53550         53550      uc001ysz.3
#> |  2858        < uc001yta.1     53551       53551         53551      uc001yta.1
#> |             color
#> |       <character>
#> |  1            red
#> |  2            red
#> |  3            red
#> |  4            red
#> |  5            red
#> |  ...          ...
#> |  2854         red
#> |  2855         red
#> |  2856         red
#> |  2857         red
#> |  2858         red

Bioconductor has a number of packages that supply aligned RNA-seq reads in BAM format. We’ll use one that investigated the effects of knocking down a gene called HNRNPC in HeLa cells.

A character vector provides the paths to the BAM files:

length(RNAseqData.HNRNPC.bam.chr14_BAMFILES)
#> [1] 8
RNAseqData.HNRNPC.bam.chr14_BAMFILES[1]
#>                                                                               ERR127306 
#> "/usr/local/lib/R/site-library/RNAseqData.HNRNPC.bam.chr14/extdata/ERR127306_chr14.bam"

The 8 files are grouped: the first four are for wild-type cells, the next four are for the cells treated to knock down HNRNPC.

labels = c("wt1", "wt2", "wt3", "wt4", "kd1", "kd2", "kd3", "kd4")

The GenomicAlignments package provides functions to parse and ingest information from BAM files.

library(GenomicAlignments)
readGAlignments(RNAseqData.HNRNPC.bam.chr14_BAMFILES[1])
#> GAlignments object with 800484 alignments and 0 metadata columns:
#>            seqnames strand       cigar    qwidth     start       end     width
#>               <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
#>        [1]    chr14      +         72M        72  19069583  19069654        72
#>        [2]    chr14      +         72M        72  19363738  19363809        72
#>        [3]    chr14      -         72M        72  19363755  19363826        72
#>        [4]    chr14      +         72M        72  19369799  19369870        72
#>        [5]    chr14      -         72M        72  19369828  19369899        72
#>        ...      ...    ...         ...       ...       ...       ...       ...
#>   [800480]    chr14      -         72M        72 106989780 106989851        72
#>   [800481]    chr14      +         72M        72 106994763 106994834        72
#>   [800482]    chr14      -         72M        72 106994819 106994890        72
#>   [800483]    chr14      +         72M        72 107003080 107003151        72
#>   [800484]    chr14      -         72M        72 107003171 107003242        72
#>                njunc
#>            <integer>
#>        [1]         0
#>        [2]         0
#>        [3]         0
#>        [4]         0
#>        [5]         0
#>        ...       ...
#>   [800480]         0
#>   [800481]         0
#>   [800482]         0
#>   [800483]         0
#>   [800484]         0
#>   -------
#>   seqinfo: 93 sequences from an unspecified genome

We’ll now select four of the BAM files to display coverage in the vicinity of the knocked-down gene. The function linetrack_covg_by_symbol was written specifically to combine information on reads with positions of genes and transcripts.

allfi = lapply(c(1,2,5,6), function(x) {
     curf = RNAseqData.HNRNPC.bam.chr14_BAMFILES[x]
     linetrack_covg_by_symbol("HNRNPC", curf, 
             color="gray", radius=3e5, label=labels[x])
     })
mytr = c(allfi, tnt_gt_sym, tnt_txtr_14)
TnTGenome(mytr)
#> - Missing argument `view.range`:
#>   automatically select 21373939..22105960 on seqlevel chr14...

AnnotationHub example: Epigenetically defined functional annotation

Finally, the AnnotationHub package defines a database with many types of biological annotation. For our last example in this section, we consider how to obtain the ChromHmm cell type annotation for diverse cell types. The AnnotationHub can be updated in real time. We obtain an image of the hub, and then “query”.

library(AnnotationHub)
ah = AnnotationHub()
#> using temporary cache /tmp/RtmpU3su3v/BiocFileCache
#> snapshotDate(): 2020-10-27
chq = query(ah, "ChromHmm")
chq
#> AnnotationHub with 128 records
#> # snapshotDate(): 2020-10-27
#> # $dataprovider: BroadInstitute, UCSC
#> # $species: Homo sapiens
#> # $rdataclass: GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH5202"]]' 
#> 
#>             title                             
#>   AH5202  | Broad ChromHMM                    
#>   AH46856 | E001_15_coreMarks_mnemonics.bed.gz
#>   AH46857 | E002_15_coreMarks_mnemonics.bed.gz
#>   AH46858 | E003_15_coreMarks_mnemonics.bed.gz
#>   AH46859 | E004_15_coreMarks_mnemonics.bed.gz
#>   ...       ...                               
#>   AH46978 | E125_15_coreMarks_mnemonics.bed.gz
#>   AH46979 | E126_15_coreMarks_mnemonics.bed.gz
#>   AH46980 | E127_15_coreMarks_mnemonics.bed.gz
#>   AH46981 | E128_15_coreMarks_mnemonics.bed.gz
#>   AH46982 | E129_15_coreMarks_mnemonics.bed.gz

The result here can be very large and so an abbreviated view is dumped. To learn about the cell types studied, we use a technical call that takes advantage of knowledge that mcols on a query result gives a table of features for each resource.

mcols(chq)["AH46861",]$tags
#> $AH46861
#> [1] "EpigenomeRoadMap"                  "chromhmmSegmentations"            
#> [3] "ChmmModels"                        "coreMarks"                        
#> [5] "E006"                              "ES-deriv"                         
#> [7] "ESDR.H1.MSC"                       "H1 Derived Mesenchymal Stem Cells"
head(table(vapply(mcols(chq)[-1,]$tags, 
   function(x)x[8],  character(1))))
#> 
#>           A549 EtOH 0.02pct Lung Carcinoma Cell Line 
#>                                                    1 
#> Adipose Derived Mesenchymal Stem Cell Cultured Cells 
#>                                                    1 
#>                                       Adipose Nuclei 
#>                                                    1 
#>                                                Aorta 
#>                                                    1 
#>  Bone Marrow Derived Cultured Mesenchymal Stem Cells 
#>                                                    1 
#>                                  Brain Angular Gyrus 
#>                                                    1

Now we can acquire the ChromHmm labeling of genomic regions for the H1-derived mesenchymal stem cells:

h1msc = ah[["AH46861"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
h1msc
#> GRanges object with 538000 ranges and 4 metadata columns:
#>            seqnames            ranges strand |        abbr                name
#>               <Rle>         <IRanges>  <Rle> | <character>         <character>
#>        [1]    chr10           1-93200      * |    15_Quies       Quiescent/Low
#>        [2]    chr10       93201-95200      * |      5_TxWk  Weak transcription
#>        [3]    chr10       95201-95400      * |  8_ZNF/Rpts ZNF genes & repeats
#>        [4]    chr10      95401-119400      * |    15_Quies       Quiescent/Low
#>        [5]    chr10     119401-119600      * |       7_Enh           Enhancers
#>        ...      ...               ...    ... .         ...                 ...
#>   [537996]     chrY 59027601-59027800      * |  8_ZNF/Rpts ZNF genes & repeats
#>   [537997]     chrY 59027801-59029000      * |       9_Het     Heterochromatin
#>   [537998]     chrY 59029001-59029200      * |  8_ZNF/Rpts ZNF genes & repeats
#>   [537999]     chrY 59029201-59033400      * |       9_Het     Heterochromatin
#>   [538000]     chrY 59033401-59373400      * |    15_Quies       Quiescent/Low
#>                   color_name  color_code
#>                  <character> <character>
#>        [1]             White     #FFFFFF
#>        [2]         DarkGreen     #006400
#>        [3] Medium Aquamarine     #66CDAA
#>        [4]             White     #FFFFFF
#>        [5]            Yellow     #FFFF00
#>        ...               ...         ...
#>   [537996] Medium Aquamarine     #66CDAA
#>   [537997]     PaleTurquoise     #8A91D0
#>   [537998] Medium Aquamarine     #66CDAA
#>   [537999]     PaleTurquoise     #8A91D0
#>   [538000]             White     #FFFFFF
#>   -------
#>   seqinfo: 298 sequences (2 circular) from hg19 genome
table(h1msc$name)
#> 
#>                 Active TSS          Bivalent Enhancer 
#>                      16526                      14167 
#>        Bivalent/Poised TSS                  Enhancers 
#>                       1905                      98782 
#>        Flanking Active TSS  Flanking Bivalent TSS/Enh 
#>                      18643                       4990 
#>            Genic enhancers            Heterochromatin 
#>                      29466                      45747 
#>              Quiescent/Low         Repressed PolyComb 
#>                      75063                      24125 
#>       Strong transcription Transcr. at gene 5' and 3' 
#>                      39066                       2292 
#>    Weak Repressed PolyComb         Weak transcription 
#>                      40779                     115876 
#>        ZNF genes & repeats 
#>                      10573

The Cancer Genome Atlas, a paradigm of federated multiomics

The curatedTCGAData package provides convenient access to all open data in TCGA. Here we’ll request available data on tumor mutation profiles and RNA-seq studies available for breast cancer (BRCA).

library(curatedTCGAData)
brca_demo = curatedTCGAData("BRCA", c("Mutation", "RNASeq2GeneNorm"),
   dry.run=FALSE)
#> snapshotDate(): 2020-10-02
#> Working on: BRCA_Mutation-20160128
#> see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> Working on: BRCA_RNASeq2GeneNorm-20160128
#> see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> Loading required package: RaggedExperiment
#> Working on: BRCA_colData-20160128
#> see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> Working on: BRCA_metadata-20160128
#> see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> Working on: BRCA_sampleMap-20160128
#> see ?curatedTCGAData and browseVignettes('curatedTCGAData') for documentation
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
#> harmonizing input:
#>   removing 12168 sampleMap rows not in names(experiments)
#>   removing 2 colData rownames not in sampleMap 'primary'
brca_demo
#> A MultiAssayExperiment object of 2 listed
#>  experiments with user-defined names and respective classes.
#>  Containing an ExperimentList class object of length 2:
#>  [1] BRCA_Mutation-20160128: RaggedExperiment with 90490 rows and 993 columns
#>  [2] BRCA_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 1212 columns
#> Functionality:
#>  experiments() - obtain the ExperimentList instance
#>  colData() - the primary/phenotype DataFrame
#>  sampleMap() - the sample coordination DataFrame
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment
#>  *Format() - convert into a long or wide DataFrame
#>  assays() - convert ExperimentList to a SimpleList of matrices
#>  exportClass() - save all data to files

The MultiAssayExperiment structure is described in a JCO CCI paper.
Timestamped tags are provided for each assay type.

names(experiments(brca_demo))
#> [1] "BRCA_Mutation-20160128"        "BRCA_RNASeq2GeneNorm-20160128"

Mutations in a GRangesList

We can get immediate access to genomic coordinates and contents of tumor mutations by coercing the mutations component to GRangesList. There is one list element per participant.

grm = as(experiments(brca_demo)[["BRCA_Mutation-20160128"]], "GRangesList")
names(grm)[1:3]
#> [1] "TCGA-A1-A0SB-01A-11D-A142-09" "TCGA-A1-A0SD-01A-11D-A10Y-09"
#> [3] "TCGA-A1-A0SE-01A-11D-A099-09"
grm[[1]][1:4,1:8]
#> GRanges object with 4 ranges and 8 metadata columns:
#>       seqnames    ranges strand | Hugo_Symbol Entrez_Gene_Id           Center
#>          <Rle> <IRanges>  <Rle> | <character>    <character>      <character>
#>   [1]       10 116247760      + |      ABLIM1              0 genome.wustl.edu
#>   [2]       12  43944926      + |    ADAMTS20              0 genome.wustl.edu
#>   [3]        3  85932472      + |       CADM2              0 genome.wustl.edu
#>   [4]        2  25678299      + |        DTNB              0 genome.wustl.edu
#>        NCBI_Build Variant_Classification Variant_Type Reference_Allele
#>       <character>            <character>  <character>      <character>
#>   [1]          37      Missense_Mutation          SNP                T
#>   [2]          37      Missense_Mutation          SNP                T
#>   [3]          37                 Silent          SNP                C
#>   [4]          37      Missense_Mutation          SNP                C
#>       Tumor_Seq_Allele1
#>             <character>
#>   [1]                 T
#>   [2]                 T
#>   [3]                 C
#>   [4]                 C
#>   -------
#>   seqinfo: 26 sequences from 37 genome; no seqlengths

A SummarizedExperiment for RNA-seq

rnaseq = experiments(brca_demo)[["BRCA_RNASeq2GeneNorm-20160128"]]
rnaseq
#> class: SummarizedExperiment 
#> dim: 20501 1212 
#> metadata(0):
#> assays(1): ''
#> rownames(20501): A1BG A1CF ... psiTPTE22 tAKR
#> rowData names(0):
#> colnames(1212): TCGA-3C-AAAU-01A-11R-A41B-07
#>   TCGA-3C-AALI-01A-11R-A41B-07 ... TCGA-Z7-A8R5-01A-42R-A41B-07
#>   TCGA-Z7-A8R6-01A-11R-A41B-07
#> colData names(0):
head(names(colData(rnaseq)))
#> character(0)

The OSCA book, a paradigm of computable scientific literature

  • OSCA book
  • probably need 1b on scRNA, one part on just the basics, on on how to work with the clusters you find