vignettes/ieu_cat.Rmd
ieu_cat.Rmd
Our aim here is to improve curation of Open GWAS records by coupling to the EBI GWAS catalog.
Here are previews of gwinf
(Open GWAS) and ebi
(EBI catalog) resources.
## # A tibble: 42,334 × 22
## id trait note group…¹ mr year author sex pmid popul…² unit
## <chr> <chr> <chr> <chr> <int> <int> <chr> <chr> <int> <chr> <chr>
## 1 ieu-b-5075 Syst… NA public 1 2021 Sakau… Male… 3.46e7 East A… mmHg
## 2 ieu-b-5064 Seps… HES … public NA 2021 Hamil… Male… NA Europe… logOR
## 3 eqtl-a-EN… ENSG… NA public 1 2018 Vosa U Male… NA Europe… NA
## 4 ukb-b-1489 Chee… 1408… public 1 2018 Ben E… Male… NA Europe… SD
## 5 ukb-b-8727 Age … 2764… public 1 2018 Ben E… Male… NA Europe… SD
## 6 ukb-a-583 Diag… NA public 1 2017 Neale Male… NA Europe… SD
## 7 eqtl-a-EN… ENSG… NA public 1 2018 Vosa U Male… NA Europe… NA
## 8 ukb-b-124… Trea… 2000… public 1 2018 Ben E… Male… NA Europe… SD
## 9 ukb-e-767… Leng… NA public 1 2020 Pan-U… Male… NA East A… NA
## 10 eqtl-a-EN… ENSG… NA public 1 2018 Vosa U Male… NA Europe… NA
## # … with 42,324 more rows, 11 more variables: sample_size <int>, nsnp <int>,
## # build <chr>, category <chr>, subcategory <chr>, ontology <chr>,
## # ncase <int>, consortium <chr>, ncontrol <int>, priority <int>, sd <dbl>,
## # and abbreviated variable names ¹group_name, ²population
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
ebi = get_cached_gwascat()
ebi
## # A tibble: 402,121 × 38
## DATE ADDED T…¹ PUBME…² FIRST…³ DATE JOURNAL LINK STUDY DISEA…⁴ INITI…⁵
## <date> <dbl> <chr> <date> <chr> <chr> <chr> <chr> <chr>
## 1 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 2 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 3 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 4 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 5 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 6 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 7 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 8 2010-09-21 2.07e7 Ferrei… 2010-08-08 Nat Ge… www.… Asso… Immuno… 430 Eu…
## 9 2010-09-21 2.07e7 Ferrei… 2010-08-08 Nat Ge… www.… Asso… Immuno… 430 Eu…
## 10 2010-09-21 2.07e7 Ferrei… 2010-08-08 Nat Ge… www.… Asso… Immuno… 430 Eu…
## # … with 402,111 more rows, 29 more variables: `REPLICATION SAMPLE SIZE` <chr>,
## # REGION <chr>, CHR_ID <chr>, CHR_POS <chr>, `REPORTED GENE(S)` <chr>,
## # MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>, DOWNSTREAM_GENE_ID <chr>,
## # SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
## # DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>,
## # SNPS <chr>, MERGED <dbl>, SNP_ID_CURRENT <dbl>, CONTEXT <chr>,
## # INTERGENIC <dbl>, `RISK ALLELE FREQUENCY` <chr>, `P-VALUE` <dbl>, …
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
gwinf
to records with accession numbers
We’ll likely only gain information for records with GCST (study accession) tags.
## [1] 2585 22
gwinf$acc = gsub("ebi-..", "", gwinf$id)
The EBI catalog has locus-specific records; we are interested in studies.
ebi = ebi[-which(duplicated(ebi$`STUDY ACCESSION`)),]
nn = inner_join(mutate(ebi, acc=`STUDY ACCESSION`), gwinf, by="acc")
dim(nn)
## [1] 1812 61
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html