vignettes/B1_TCGA.Rmd
B1_TCGA.Rmd
The Cancer Genome Atlas (TCGA) collects information on thousands of cases of tumors from 68 anatomical sites. Several approaches to working with TCGA are available in Bioconductor. We’ll focus on the package developed by the Waldron Lab at City University of New York.
library(curatedTCGAData)
The following excerpt from the manual page for the curatedTCGAData
function defines the scope of genomic assays available for most tumor types.
Available Assays:
Below is a list of partial ExperimentList assay names and their
respective description. These assays can be entered as part of the
'assays' argument in the main function. Partial glob matches are
allowed such as: ''CN*'' for "CNASeq", "CNASNP", "CNVSNP" assays.
Credit: Ludwig G.
ExperimentList data types Description
----------------------------------------------------------------------------
SummarizedExperiment*
RNASeqGene Gene expression values
RNASeq2Gene RSEM TPM gene expression values
RNASeq2GeneNorm Upper quartile normalized RSEM TPM gene
expression values
miRNAArray Probe-level miRNA expression values
miRNASeqGene Gene-level log2 RPM miRNA expression values
mRNAArray Unified gene-level mRNA expression values
mRNAArray_huex Gene-level mRNA expression values from Affymetrix
Human Exon Array
mRNAArray_TX_g4502a Gene-level mRNA expression values from Agilent
244K Array
mRNAArray_TX_ht_hg_u133a Gene-level mRNA expression values from Affymetrix
Human Genome U133 Array
GISTIC_AllByGene Gene-level GISTIC2 copy number values
GISTIC_ThresholdedByGene Gene-level GISTIC2 thresholded discrete copy
number values
RPPAArray Reverse Phase Protein Array normalized protein
expression values
RangedSummarizedExperiment
GISTIC_Peaks GISTIC2 thresholded discrete copy number values
in recurrent peak regions
SummarizedExperiment with HDF5Array DelayedMatrix
Methylation_methyl27 Probe-level methylation beta values from Illumina
HumanMethylation 27K BeadChip
Methylation_methyl450 Probe-level methylation beta values from Infinium
HumanMethylation 450K BeadChip
RaggedExperiment
CNASNP Segmented somatic Copy Number Alteration calls
from SNP array
CNVSNP Segmented germline Copy Number Variant calls from
SNP Array
CNASeq Segmented somatic Copy Number Alteration calls
from low pass DNA Sequencing
Mutation* Somatic mutations calls
CNACGH_CGH_hg_244a Segmented somatic Copy Number Alteration calls
from CGH Agilent Microarray 244A
CNACGH_CGH_hg_415k_g4124a Segmented somatic Copy Number Alteration calls
from CGH Agilent Microarray 415K
* All can be converted to RangedSummarizedExperiment (except RPPAArray) with
TCGAutils
version:
The new version 2.0.1 includes various improvements including an
additional assay that provides 'RNASeq2Gene' data as RSEM TPM gene
expression values (issue #38). Additional changes include genomic
library(curatedTCGAData)
library(TCGAutils)
br = curatedTCGAData("BRCA", "Mutation", version="2.0.1", dry.run=FALSE)
brp = TCGAprimaryTumors(br) # remove some normal tissue samples
mut = experiments(brp)[[1]]
mut
## class: RaggedExperiment
## dim: 90490 988
## assays(62): Hugo_Symbol Entrez_Gene_Id ... EVS_AA EVS_All
## rownames: NULL
## colnames(988): TCGA-A1-A0SB-01A-11D-A142-09
## TCGA-A1-A0SD-01A-11D-A10Y-09 ... TCGA-PE-A5DD-01A-12D-A27P-09
## TCGA-PE-A5DE-01A-11D-A27P-09
## colData names(0):
mut
is an instance of the RaggedExperiment class. This class was devised to accommodate data in which there is wide variation in the number of features recorded on each patient. RaggedExperiment is also unusual in that there are many assay components.
length(assayNames(mut))
## [1] 62
head(assayNames(mut),10)
## [1] "Hugo_Symbol" "Entrez_Gene_Id" "Center"
## [4] "NCBI_Build" "Variant_Classification" "Variant_Type"
## [7] "Reference_Allele" "Tumor_Seq_Allele1" "Tumor_Seq_Allele2"
## [10] "dbSNP_RS"
This computation is unwieldy. We just record the gene names and number of times some mutation is reported for some sample.
tail(sort(table(as.character(assay(mut)))),20)
SPEN FCGBP RYR3 DMD CROCCP2 HMCN1 DST FLG SYNE1 RYR2
61 63 63 64 66 67 69 69 74 76
MLL3 MUC4 MUC12 GATA3 MAP3K1 CDH1 MUC16 TP53 TTN PIK3CA
92 92 101 106 106 123 152 332 360 384
The distribution of counts of mutations per individual:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 28.00 43.50 91.27 79.00 5707.00
A hypermutator:
which.max(nmut_per_sample)
## TCGA-AN-A046-01A-21W-A050-09
## 285
##
## 1 2 3 4 5 6 7 8 9 10 15 36
## 3252 709 188 47 20 13 3 2 1 1 1 1
## TTN
## 3795
Several:
## class: RaggedExperiment
## dim: 90490 6
## assays(62): Hugo_Symbol Entrez_Gene_Id ... EVS_AA EVS_All
## rownames: NULL
## colnames(6): TCGA-A8-A09Z-01A-11W-A019-09 TCGA-BH-A18G-01A-11D-A12B-09
## ... TCGA-AC-A23H-01A-11D-A159-09 TCGA-AN-A046-01A-21W-A050-09
## colData names(0):
This individual presents single mutations in many genes.
##
## 1 2 3
## 1258 87 2
This one has 4 genes with more than 10 mutations in each.
##
## 1 2 3 4 5 6 7 8 12 18 19 20
## 1984 340 79 27 5 6 3 2 1 1 1 1
We include some helper code here to produce comparisons of distributions of mutation types in genes mutated in breast cancer.
library(MultiAssayExperiment)
cropSE = function (se, sym = "AHNAK",
base = ensembldb::genes(EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75), gr=NULL) {
rr = rowRanges(se)
if (is.null(gr)) cr = base[which(base$symbol == sym)]
else cr = gr
fi = findOverlaps(rr, cr, ignore.strand = TRUE)
se[queryHits(fi), ]
}
.muttab = function(se, sym="AHNAK",
base = ensembldb::genes(EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75),
assayind=5) {
cr = cropSE(se, sym, base)
tab = table(assay(cr,assayind))
data.frame(gene=sym, muttype=names(tab), freq=as.numeric(tab))
}
muttab = function(se, sym=c("MAP3K1", "CDH1", "MUC16", "TP53", "TTN", "PIK3CA"),
base = ensembldb::genes(EnsDb.Hsapiens.v75::EnsDb.Hsapiens.v75),
assayind=5) {
ans = lapply(sym, function(x) .muttab(se, x, base, assayind))
do.call(rbind, ans)
}
Here we produce an interactive plot. You can hover over the bars to see text on type and count for each mutation type for each gene.
We limit attention to the mutations that have been annotated to PIK3CA.
pi = cropSE(mut, "PIK3CA")
pi
## class: RaggedExperiment
## dim: 387 988
## assays(62): Hugo_Symbol Entrez_Gene_Id ... EVS_AA EVS_All
## rownames: NULL
## colnames(988): TCGA-A1-A0SB-01A-11D-A142-09
## TCGA-A1-A0SD-01A-11D-A10Y-09 ... TCGA-PE-A5DD-01A-12D-A27P-09
## TCGA-PE-A5DE-01A-11D-A27P-09
## colData names(0):
The catalog of all exons:
ex = exons(EnsDb.Hsapiens.v75)
ex
## GRanges object with 745593 ranges and 1 metadata column:
## seqnames ranges strand | exon_id
## <Rle> <IRanges> <Rle> | <character>
## ENSE00002234944 1 11869-12227 + | ENSE00002234944
## ENSE00002234632 1 11872-12227 + | ENSE00002234632
## ENSE00002269724 1 11874-12227 + | ENSE00002269724
## ENSE00001948541 1 12010-12057 + | ENSE00001948541
## ENSE00001671638 1 12179-12227 + | ENSE00001671638
## ... ... ... ... . ...
## ENSE00001741452 Y 28774418-28774584 - | ENSE00001741452
## ENSE00001681574 Y 28776794-28776896 - | ENSE00001681574
## ENSE00001638296 Y 28779492-28779578 - | ENSE00001638296
## ENSE00001797328 Y 28780670-28780799 - | ENSE00001797328
## ENSE00001794473 Y 59001391-59001635 + | ENSE00001794473
## -------
## seqinfo: 273 sequences (1 circular) from GRCh37 genome
The findOverlaps methods work with a “query” and a “subject”, and produce a “hits” object, which has methods “queryHits” and “subjectHits”.
fo = findOverlaps(ex, rowRanges(pi))
table(queryHits(fo))
##
## 483978 483979 483980 483982 483983 483984 483985 483986 483987 483989 483990
## 13 13 5 23 5 1 14 1 132 4 4
## 483993 483994 483995 483996 483997 483998 484002 484003 484004
## 10 10 2 2 1 1 2 3 171
Here’s an exon within PIK3CA within which many mutations were observed:
ex[483987]
## GRanges object with 1 range and 1 metadata column:
## seqnames ranges strand | exon_id
## <Rle> <IRanges> <Rle> | <character>
## ENSE00001077674 3 178935998-178936122 + | ENSE00001077674
## -------
## seqinfo: 273 sequences (1 circular) from GRCh37 genome
##
## 178936074 178936082 178936083 178936091 178936092 178936094 178936095
## 2 42 2 70 5 4 7
Several positions are found recurrently mutated.