Introduction

The Cancer Genome Atlas (TCGA) collects information on thousands of cases of tumors from 68 anatomical sites. Several approaches to working with TCGA are available in Bioconductor. We’ll focus on the package developed by the Waldron Lab at City University of New York.

library(curatedTCGAData)

The following excerpt from the manual page for the curatedTCGAData function defines the scope of genomic assays available for most tumor types.

Available Assays:

     Below is a list of partial ExperimentList assay names and their
     respective description. These assays can be entered as part of the
     'assays' argument in the main function. Partial glob matches are
     allowed such as: ''CN*'' for "CNASeq", "CNASNP", "CNVSNP" assays.
     Credit: Ludwig G.

     ExperimentList data types   Description
     ----------------------------------------------------------------------------
     SummarizedExperiment*
       RNASeqGene                Gene expression values
       RNASeq2Gene               RSEM TPM gene expression values
       RNASeq2GeneNorm           Upper quartile normalized RSEM TPM gene
                                 expression values
       miRNAArray                Probe-level  miRNA expression values
       miRNASeqGene              Gene-level log2 RPM miRNA expression values
       mRNAArray                 Unified gene-level mRNA expression values
       mRNAArray_huex            Gene-level mRNA expression values from Affymetrix
                                 Human Exon Array
       mRNAArray_TX_g4502a       Gene-level mRNA expression values from Agilent
                                 244K Array
       mRNAArray_TX_ht_hg_u133a  Gene-level mRNA expression values from Affymetrix
                                 Human Genome U133 Array
       GISTIC_AllByGene          Gene-level GISTIC2 copy number values
       GISTIC_ThresholdedByGene  Gene-level GISTIC2 thresholded discrete copy
                                 number values
       RPPAArray                 Reverse Phase Protein Array normalized protein
                                 expression values
     RangedSummarizedExperiment
       GISTIC_Peaks              GISTIC2 thresholded discrete copy number values
                                 in recurrent peak regions
     SummarizedExperiment with HDF5Array DelayedMatrix
       Methylation_methyl27      Probe-level methylation beta values from Illumina
                                 HumanMethylation 27K BeadChip
       Methylation_methyl450     Probe-level methylation beta values from Infinium
                                 HumanMethylation 450K BeadChip
     RaggedExperiment
       CNASNP                    Segmented somatic Copy Number Alteration calls
                                 from SNP array
       CNVSNP                    Segmented germline Copy Number Variant calls from
                                 SNP Array
       CNASeq                    Segmented somatic Copy Number Alteration calls
                                 from low pass DNA Sequencing
       Mutation*                 Somatic mutations calls
       CNACGH_CGH_hg_244a        Segmented somatic Copy Number Alteration calls
                                 from CGH Agilent Microarray 244A
       CNACGH_CGH_hg_415k_g4124a Segmented somatic Copy Number Alteration calls
                                 from CGH Agilent Microarray 415K
     * All can be converted to RangedSummarizedExperiment (except RPPAArray) with
     TCGAutils
     
version:

     The new version 2.0.1 includes various improvements including an
     additional assay that provides 'RNASeq2Gene' data as RSEM TPM gene
     expression values (issue #38). Additional changes include genomic

Mutations in the BRCA collection

library(curatedTCGAData)
library(TCGAutils)
br = curatedTCGAData("BRCA", "Mutation", version="2.0.1", dry.run=FALSE)
brp = TCGAprimaryTumors(br)  # remove some normal tissue samples
mut = experiments(brp)[[1]]
mut
## class: RaggedExperiment 
## dim: 90490 988 
## assays(62): Hugo_Symbol Entrez_Gene_Id ... EVS_AA EVS_All
## rownames: NULL
## colnames(988): TCGA-A1-A0SB-01A-11D-A142-09
##   TCGA-A1-A0SD-01A-11D-A10Y-09 ... TCGA-PE-A5DD-01A-12D-A27P-09
##   TCGA-PE-A5DE-01A-11D-A27P-09
## colData names(0):

mut is an instance of the RaggedExperiment class. This class was devised to accommodate data in which there is wide variation in the number of features recorded on each patient. RaggedExperiment is also unusual in that there are many assay components.

## [1] 62
head(assayNames(mut),10)
##  [1] "Hugo_Symbol"            "Entrez_Gene_Id"         "Center"                
##  [4] "NCBI_Build"             "Variant_Classification" "Variant_Type"          
##  [7] "Reference_Allele"       "Tumor_Seq_Allele1"      "Tumor_Seq_Allele2"     
## [10] "dbSNP_RS"

Distributions of mutation counts

This computation is unwieldy. We just record the gene names and number of times some mutation is reported for some sample.

tail(sort(table(as.character(assay(mut)))),20)

   SPEN   FCGBP    RYR3     DMD CROCCP2   HMCN1     DST     FLG   SYNE1    RYR2 
     61      63      63      64      66      67      69      69      74      76 
   MLL3    MUC4   MUC12   GATA3  MAP3K1    CDH1   MUC16    TP53     TTN  PIK3CA 
     92      92     101     106     106     123     152     332     360     384 

The distribution of counts of mutations per individual:

nmut_per_sample = apply(assay(mut),2,function(x)sum(!is.na(x)))
summary(nmut_per_sample)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   28.00   43.50   91.27   79.00 5707.00

Identifying hypermutated samples

A hypermutator:

which.max(nmut_per_sample)
## TCGA-AN-A046-01A-21W-A050-09 
##                          285
table(table(assay(mut[,285])))
## 
##    1    2    3    4    5    6    7    8    9   10   15   36 
## 3252  709  188   47   20   13    3    2    1    1    1    1
which(table(assay(mut[,285]))==36)
##  TTN 
## 3795

Several:

hyp = tail(sort(nmut_per_sample))
himut = mut[,names(hyp)]
himut
## class: RaggedExperiment 
## dim: 90490 6 
## assays(62): Hugo_Symbol Entrez_Gene_Id ... EVS_AA EVS_All
## rownames: NULL
## colnames(6): TCGA-A8-A09Z-01A-11W-A019-09 TCGA-BH-A18G-01A-11D-A12B-09
##   ... TCGA-AC-A23H-01A-11D-A159-09 TCGA-AN-A046-01A-21W-A050-09
## colData names(0):

This individual presents single mutations in many genes.

table(table(assay(himut[,1])))
## 
##    1    2    3 
## 1258   87    2

This one has 4 genes with more than 10 mutations in each.

table(table(assay(himut[,4])))
## 
##    1    2    3    4    5    6    7    8   12   18   19   20 
## 1984  340   79   27    5    6    3    2    1    1    1    1

Mutation types: Insertion, deletion, …

We include some helper code here to produce comparisons of distributions of mutation types in genes mutated in breast cancer.

library(MultiAssayExperiment)
cropSE = function (se, sym = "AHNAK", 
  base = ensembldb::genes(EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75), gr=NULL) {
    rr = rowRanges(se)
    if (is.null(gr)) cr = base[which(base$symbol == sym)]
    else cr = gr
    fi = findOverlaps(rr, cr, ignore.strand = TRUE)
    se[queryHits(fi), ]
}

.muttab = function(se, sym="AHNAK", 
   base = ensembldb::genes(EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75),
   assayind=5) {
 cr = cropSE(se, sym, base)
 tab = table(assay(cr,assayind))
 data.frame(gene=sym, muttype=names(tab), freq=as.numeric(tab))
}

muttab = function(se, sym=c("MAP3K1", "CDH1", "MUC16", "TP53", "TTN", "PIK3CA"), 
   base = ensembldb::genes(EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75),
   assayind=5) {
  ans = lapply(sym, function(x) .muttab(se, x, base, assayind))
  do.call(rbind, ans)
}

Here we produce an interactive plot. You can hover over the bars to see text on type and count for each mutation type for each gene.

dd = muttab(mut)
library(ggplot2)
p = ggplot(dd, aes(fill=muttype, y=freq, x=gene)) + 
   geom_bar(stat="identity", position="dodge")
library(plotly)
ggplotly(p)

Identifying a mutated exon

We limit attention to the mutations that have been annotated to PIK3CA.

pi = cropSE(mut, "PIK3CA")
pi
## class: RaggedExperiment 
## dim: 387 988 
## assays(62): Hugo_Symbol Entrez_Gene_Id ... EVS_AA EVS_All
## rownames: NULL
## colnames(988): TCGA-A1-A0SB-01A-11D-A142-09
##   TCGA-A1-A0SD-01A-11D-A10Y-09 ... TCGA-PE-A5DD-01A-12D-A27P-09
##   TCGA-PE-A5DE-01A-11D-A27P-09
## colData names(0):

The catalog of all exons:

ex = exons(EnsDb.Hsapiens.v75)
ex
## GRanges object with 745593 ranges and 1 metadata column:
##                   seqnames            ranges strand |         exon_id
##                      <Rle>         <IRanges>  <Rle> |     <character>
##   ENSE00002234944        1       11869-12227      + | ENSE00002234944
##   ENSE00002234632        1       11872-12227      + | ENSE00002234632
##   ENSE00002269724        1       11874-12227      + | ENSE00002269724
##   ENSE00001948541        1       12010-12057      + | ENSE00001948541
##   ENSE00001671638        1       12179-12227      + | ENSE00001671638
##               ...      ...               ...    ... .             ...
##   ENSE00001741452        Y 28774418-28774584      - | ENSE00001741452
##   ENSE00001681574        Y 28776794-28776896      - | ENSE00001681574
##   ENSE00001638296        Y 28779492-28779578      - | ENSE00001638296
##   ENSE00001797328        Y 28780670-28780799      - | ENSE00001797328
##   ENSE00001794473        Y 59001391-59001635      + | ENSE00001794473
##   -------
##   seqinfo: 273 sequences (1 circular) from GRCh37 genome

The findOverlaps methods work with a “query” and a “subject”, and produce a “hits” object, which has methods “queryHits” and “subjectHits”.

fo = findOverlaps(ex, rowRanges(pi)) 
table(queryHits(fo))
## 
## 483978 483979 483980 483982 483983 483984 483985 483986 483987 483989 483990 
##     13     13      5     23      5      1     14      1    132      4      4 
## 483993 483994 483995 483996 483997 483998 484002 484003 484004 
##     10     10      2      2      1      1      2      3    171

Here’s an exon within PIK3CA within which many mutations were observed:

ex[483987]
## GRanges object with 1 range and 1 metadata column:
##                   seqnames              ranges strand |         exon_id
##                      <Rle>           <IRanges>  <Rle> |     <character>
##   ENSE00001077674        3 178935998-178936122      + | ENSE00001077674
##   -------
##   seqinfo: 273 sequences (1 circular) from GRCh37 genome
lit = cropSE(mut, gr=ex[483987])
table(start(rowRanges(lit)))
## 
## 178936074 178936082 178936083 178936091 178936092 178936094 178936095 
##         2        42         2        70         5         4         7

Several positions are found recurrently mutated.

Exercise

What is the most commonly mutated position in the breast cancer genomes in TCGA?

Mutation in relation to survival times

Exercise

Use example(build_surv_for_mut) to compare survival profiles for individuals with and without TTN mutation.

Produce a similar display comparing individuals with and without PIK3CA mutation.