Aim

Our aim here is to improve curation of Open GWAS records by coupling to the EBI GWAS catalog.

Here are previews of gwinf (Open GWAS) and ebi (EBI catalog) resources.

library(ieugwasr) # from github mrcieu
library(gwascat)
gwinf = gwasinfo()
gwinf
## # A tibble: 42,334 × 22
##    id         trait note  group…¹    mr  year author sex      pmid popul…² unit 
##    <chr>      <chr> <chr> <chr>   <int> <int> <chr>  <chr>   <int> <chr>   <chr>
##  1 ieu-b-5075 Syst… NA    public      1  2021 Sakau… Male…  3.46e7 East A… mmHg 
##  2 ieu-b-5064 Seps… HES … public     NA  2021 Hamil… Male… NA      Europe… logOR
##  3 eqtl-a-EN… ENSG… NA    public      1  2018 Vosa U Male… NA      Europe… NA   
##  4 ukb-b-1489 Chee… 1408… public      1  2018 Ben E… Male… NA      Europe… SD   
##  5 ukb-b-8727 Age … 2764… public      1  2018 Ben E… Male… NA      Europe… SD   
##  6 ukb-a-583  Diag… NA    public      1  2017 Neale  Male… NA      Europe… SD   
##  7 eqtl-a-EN… ENSG… NA    public      1  2018 Vosa U Male… NA      Europe… NA   
##  8 ukb-b-124… Trea… 2000… public      1  2018 Ben E… Male… NA      Europe… SD   
##  9 ukb-e-767… Leng… NA    public      1  2020 Pan-U… Male… NA      East A… NA   
## 10 eqtl-a-EN… ENSG… NA    public      1  2018 Vosa U Male… NA      Europe… NA   
## # … with 42,324 more rows, 11 more variables: sample_size <int>, nsnp <int>,
## #   build <chr>, category <chr>, subcategory <chr>, ontology <chr>,
## #   ncase <int>, consortium <chr>, ncontrol <int>, priority <int>, sd <dbl>,
## #   and abbreviated variable names ¹​group_name, ²​population
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
ebi = get_cached_gwascat()
ebi
## # A tibble: 402,121 × 38
##    DATE ADDED T…¹ PUBME…² FIRST…³ DATE       JOURNAL LINK  STUDY DISEA…⁴ INITI…⁵
##    <date>           <dbl> <chr>   <date>     <chr>   <chr> <chr> <chr>   <chr>  
##  1 2009-12-14      1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
##  2 2009-12-14      1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
##  3 2009-12-14      1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
##  4 2009-12-14      1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
##  5 2009-12-14      1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
##  6 2009-12-14      1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
##  7 2009-12-14      1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
##  8 2010-09-21      2.07e7 Ferrei… 2010-08-08 Nat Ge… www.… Asso… Immuno… 430 Eu…
##  9 2010-09-21      2.07e7 Ferrei… 2010-08-08 Nat Ge… www.… Asso… Immuno… 430 Eu…
## 10 2010-09-21      2.07e7 Ferrei… 2010-08-08 Nat Ge… www.… Asso… Immuno… 430 Eu…
## # … with 402,111 more rows, 29 more variables: `REPLICATION SAMPLE SIZE` <chr>,
## #   REGION <chr>, CHR_ID <chr>, CHR_POS <chr>, `REPORTED GENE(S)` <chr>,
## #   MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>, DOWNSTREAM_GENE_ID <chr>,
## #   SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
## #   DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>,
## #   SNPS <chr>, MERGED <dbl>, SNP_ID_CURRENT <dbl>, CONTEXT <chr>,
## #   INTERGENIC <dbl>, `RISK ALLELE FREQUENCY` <chr>, `P-VALUE` <dbl>, …
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Data filtering and merging

Confine gwinf to records with accession numbers

We’ll likely only gain information for records with GCST (study accession) tags.

gwinf = gwinf[grep("GCST", gwinf$id),]
dim(gwinf)
## [1] 2585   22
gwinf$acc = gsub("ebi-..", "", gwinf$id)

Filter the EBI catalog to studies

The EBI catalog has locus-specific records; we are interested in studies.

ebi = ebi[-which(duplicated(ebi$`STUDY ACCESSION`)),]

Merge and inspect

nn = inner_join(mutate(ebi, acc=`STUDY ACCESSION`), gwinf, by="acc")
dim(nn)
## [1] 1812   61
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html