vignettes/ieu_cat.Rmd
ieu_cat.RmdOur aim here is to improve curation of Open GWAS records by coupling to the EBI GWAS catalog.
Here are previews of gwinf (Open GWAS) and ebi (EBI catalog) resources.
## # A tibble: 42,334 × 22
## id trait note group…¹ mr year author sex pmid popul…² unit
## <chr> <chr> <chr> <chr> <int> <int> <chr> <chr> <int> <chr> <chr>
## 1 ieu-b-5075 Syst… NA public 1 2021 Sakau… Male… 3.46e7 East A… mmHg
## 2 ieu-b-5064 Seps… HES … public NA 2021 Hamil… Male… NA Europe… logOR
## 3 eqtl-a-EN… ENSG… NA public 1 2018 Vosa U Male… NA Europe… NA
## 4 ukb-b-1489 Chee… 1408… public 1 2018 Ben E… Male… NA Europe… SD
## 5 ukb-b-8727 Age … 2764… public 1 2018 Ben E… Male… NA Europe… SD
## 6 ukb-a-583 Diag… NA public 1 2017 Neale Male… NA Europe… SD
## 7 eqtl-a-EN… ENSG… NA public 1 2018 Vosa U Male… NA Europe… NA
## 8 ukb-b-124… Trea… 2000… public 1 2018 Ben E… Male… NA Europe… SD
## 9 ukb-e-767… Leng… NA public 1 2020 Pan-U… Male… NA East A… NA
## 10 eqtl-a-EN… ENSG… NA public 1 2018 Vosa U Male… NA Europe… NA
## # … with 42,324 more rows, 11 more variables: sample_size <int>, nsnp <int>,
## # build <chr>, category <chr>, subcategory <chr>, ontology <chr>,
## # ncase <int>, consortium <chr>, ncontrol <int>, priority <int>, sd <dbl>,
## # and abbreviated variable names ¹group_name, ²population
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
ebi = get_cached_gwascat()
ebi## # A tibble: 402,121 × 38
## DATE ADDED T…¹ PUBME…² FIRST…³ DATE JOURNAL LINK STUDY DISEA…⁴ INITI…⁵
## <date> <dbl> <chr> <date> <chr> <chr> <chr> <chr> <chr>
## 1 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 2 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 3 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 4 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 5 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 6 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 7 2009-12-14 1.99e7 Asano K 2009-11-15 Nat Ge… www.… A ge… Ulcera… 376 Ja…
## 8 2010-09-21 2.07e7 Ferrei… 2010-08-08 Nat Ge… www.… Asso… Immuno… 430 Eu…
## 9 2010-09-21 2.07e7 Ferrei… 2010-08-08 Nat Ge… www.… Asso… Immuno… 430 Eu…
## 10 2010-09-21 2.07e7 Ferrei… 2010-08-08 Nat Ge… www.… Asso… Immuno… 430 Eu…
## # … with 402,111 more rows, 29 more variables: `REPLICATION SAMPLE SIZE` <chr>,
## # REGION <chr>, CHR_ID <chr>, CHR_POS <chr>, `REPORTED GENE(S)` <chr>,
## # MAPPED_GENE <chr>, UPSTREAM_GENE_ID <chr>, DOWNSTREAM_GENE_ID <chr>,
## # SNP_GENE_IDS <chr>, UPSTREAM_GENE_DISTANCE <dbl>,
## # DOWNSTREAM_GENE_DISTANCE <dbl>, `STRONGEST SNP-RISK ALLELE` <chr>,
## # SNPS <chr>, MERGED <dbl>, SNP_ID_CURRENT <dbl>, CONTEXT <chr>,
## # INTERGENIC <dbl>, `RISK ALLELE FREQUENCY` <chr>, `P-VALUE` <dbl>, …
## # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
gwinf to records with accession numbers
We’ll likely only gain information for records with GCST (study accession) tags.
## [1] 2585 22
gwinf$acc = gsub("ebi-..", "", gwinf$id)The EBI catalog has locus-specific records; we are interested in studies.
ebi = ebi[-which(duplicated(ebi$`STUDY ACCESSION`)),]
nn = inner_join(mutate(ebi, acc=`STUDY ACCESSION`), gwinf, by="acc")
dim(nn)## [1] 1812 61
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html