vignettes/BiocAlphaMis.Rmd
BiocAlphaMis.Rmd
AlphaMissense (Science 2023) is a machine learning procedure for inferring pathogenicity of single nucleotide substitutions in the human proteome. Text files are distributed by the authors, and these have been tabix-indexed for use in this package. Note that the pathogenicity data are produced under a license that forbids commercial use.
The package described here can currently be installed using
BiocManager::install("vjcitn/BiocAlphaMis")
The following code draws the first 10 records and places them in a data.frame. Note that 600 MB of bgzipped text and a tabix index will be retrieved and cached on first attempt.
library(BiocAlphaMis)
library(Rsamtools)
amis_txf = get_alphamis_txf(build="hg38")
amis_txf
## class: TabixFile
## path: /home/vincent/.cache/R/BiocFileC.../a86b4ba7d3c4_AlphaMissense_hg38.tsv.gz
## index: /home/vincent/.cache/R/Bioc.../a86b74da37cb_AlphaMissense_hg38.tsv.gz.tbi
## isOpen: FALSE
## yieldSize: NA
yieldSize(amis_txf) = 10L
df10 = read.delim(text=scanTabix(amis_txf)[[1]], h=FALSE)
data(amnames)
names(df10) = amnames
df10
## CHROM POS REF ALT genome uniprot_id transcript_id protein_variant
## 1 chr1 69094 G T hg38 Q8NH21 ENST00000335137.4 V2L
## 2 chr1 69094 G C hg38 Q8NH21 ENST00000335137.4 V2L
## 3 chr1 69094 G A hg38 Q8NH21 ENST00000335137.4 V2M
## 4 chr1 69095 T C hg38 Q8NH21 ENST00000335137.4 V2A
## 5 chr1 69095 T A hg38 Q8NH21 ENST00000335137.4 V2E
## 6 chr1 69095 T G hg38 Q8NH21 ENST00000335137.4 V2G
## 7 chr1 69097 A G hg38 Q8NH21 ENST00000335137.4 T3A
## 8 chr1 69097 A C hg38 Q8NH21 ENST00000335137.4 T3P
## 9 chr1 69097 A T hg38 Q8NH21 ENST00000335137.4 T3S
## 10 chr1 69098 C A hg38 Q8NH21 ENST00000335137.4 T3N
## am_pathogenicity am_class
## 1 0.2937 likely_benign
## 2 0.2937 likely_benign
## 3 0.3296 likely_benign
## 4 0.2609 likely_benign
## 5 0.2922 likely_benign
## 6 0.2030 likely_benign
## 7 0.0929 likely_benign
## 8 0.1264 likely_benign
## 9 0.0979 likely_benign
## 10 0.1121 likely_benign
We used the EBI GWAS catalog and searched for “sthma” in the MAPPED_TRAIT
field. The resulting records are in a GRanges instance, accessible via data(amentgr)
.
## [1] 3488
amentgr[1:3,c("STRONGEST.SNP.RISK.ALLELE", "PUBMEDID")]
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | STRONGEST.SNP.RISK.ALLELE PUBMEDID
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] 17 39909987 * | rs2290400-C 27439200
## [2] 2 15780199 * | rs2544523-T 24993907
## [3] 5 116261351 * | rs10043228-T 24993907
## -------
## seqinfo: 23 sequences from GRCh38 genome; no seqlengths
The coincidence of these GWAS hits with AlphaMissense results (amis_txf
created above) can be computed quickly,.
library(Rsamtools)
library(GenomeInfoDb)
seqlevels(amentgr) = paste0("chr", seqlevels(amentgr)) # if off line
yieldSize(amis_txf) = NA_integer_
lk = scanTabix(amis_txf, param=amentgr)
Some of the positions with GWAS hits don’t correspond to results from AlphaMissense. These positions are empty components of the list returned by scanTabx. They are removed and the scanTabix results are converted to a data.frame.
ok = which(vapply(lk, function(x)length(x)>0, logical(1)))
cc = c("character", "character", "character", "character", "character",
"character", "character", "character", "numeric", "character")
intsect = lapply(lk[ok],
function(x) read.delim(text=x, h=FALSE, colClasses=cc))
intsect_df = do.call(rbind, intsect)
data(amnames)
names(intsect_df) = amnames
head(intsect_df,3)
## CHROM POS REF ALT genome uniprot_id
## chr4:38798089-38798089.1 chr4 38798089 T A hg38 Q15399
## chr4:38798089-38798089.2 chr4 38798089 T C hg38 Q15399
## chr4:38798089-38798089.3 chr4 38798089 T G hg38 Q15399
## transcript_id protein_variant am_pathogenicity
## chr4:38798089-38798089.1 ENST00000502213.6 N248I 0.1326
## chr4:38798089-38798089.2 ENST00000502213.6 N248S 0.0737
## chr4:38798089-38798089.3 ENST00000502213.6 N248T 0.1079
## am_class
## chr4:38798089-38798089.1 likely_benign
## chr4:38798089-38798089.2 likely_benign
## chr4:38798089-38798089.3 likely_benign
At this point, intsect_df
is the collection of all substitution scores at asthma GWAS hits. We focus on the substitutions reported as risk alleles.
amentgr$RISK_ALLELE = gsub("(.*-)", "", amentgr$STRONGEST.SNP.RISK.ALLELE)
To join the missense classes with the coincident GWAS hits we build a “key” for each table and then join.
intsect_df$key = with(intsect_df, paste(CHROM, POS, ALT, sep=":"))
ament_df = as.data.frame(amentgr)
ament_df$key = with(ament_df, paste(seqnames, CHR_POS, RISK_ALLELE, sep=":"))
ia = dplyr::inner_join(intsect_df, ament_df, by="key", relationship = "many-to-many")
iau = ia[!duplicated(ia$key),]
table(iau$am_class)
##
## ambiguous likely_benign likely_pathogenic
## 1 15 3
Some of the information about the likely pathogenic substitutions that have been identified as hits for asthma is collected here:
iau |>
dplyr::filter(am_class == "likely_pathogenic") |>
dplyr::select(CHR_ID, CHR_POS, STRONGEST.SNP.RISK.ALLELE,
OR.or.BETA, MAPPED_TRAIT, MAPPED_GENE, am_class)
## CHR_ID CHR_POS STRONGEST.SNP.RISK.ALLELE OR.or.BETA MAPPED_TRAIT
## 1 5 14610200 rs16903574-G 1.0862118 asthma
## 2 1 12115601 rs2230624-A 0.1618169 asthma
## 3 9 128721272 rs11539209-A NA asthma
## MAPPED_GENE am_class
## 1 OTULINL likely_pathogenic
## 2 TNFRSF8 likely_pathogenic
## 3 ZDHHC12 likely_pathogenic