Skip to contents

Introduction

Diverse strategies have been proposed for managing archives of DNA variants in humans.

We’ll focus on VCF and Plink in this document.

VCF access and manipulations

We will work with bgzipped, tabix-indexed resources.

A demonstration VCF archive is available with biocXqtl. The associated vcf.gz file was formed by filtering publicly available variants from the 1000 genomes project. See the Appendix for information on how the demonstration VCF was produced. A key filtering step involved limiting inclusion to those variants for which at least 10 individuals were present in each of the classes: homozygous rare, heterozygous, homozygous common.

Setup:

## Warning: package 'GenomicRanges' was built under R version 4.5.2
## Warning: replacing previous import 'BiocGenerics::type' by 'arrow::type' when
## loading 'biocXqtl'
library(VariantAnnotation)
vpath = demo_vcf()

First we examine the header.

h = scanVcfHeader(vpath)
h
## class: VCFHeader 
## samples(731): HG00096 HG00100 ... NA21129 NA21130
## meta(5): fileDate fileformat reference source contig
## fixed(2): FILTER ALT
## info(27): CIEND CIPOS ... EX_TARGET MULTI_ALLELIC
## geno(1): GT

The header report indicates that genotypes are available for 731 individuals.

Direct ingestion of VCF can consume considerable RAM. We can select regions of interest and scan with inline filtering. See the documentation on ScanVcfParam in VariantAnnotation.

library(GenomicRanges)
p = GRanges("19:1-3000000")
ss = ScanVcfParam(which=p)
sc1 = scanVcf(vpath, param=ss)

Each range in the ScanVcfParam will produce a list element in the result of scanVcf.

names(sc1[[1]])
## [1] "rowRanges" "REF"       "ALT"       "QUAL"      "FILTER"    "INFO"     
## [7] "GENO"
dim(sc1[[1]]$GENO[[1]])
## [1] 9594  731
sc1[[1]]$rowRanges
## GRanges object with 9594 ranges and 0 metadata columns:
##               seqnames          ranges strand
##                  <Rle>       <IRanges>  <Rle>
##   rs567986644       19          160565      *
##   rs548347106       19   183097-183102      *
##   rs527871033       19   212785-212786      *
##   rs533243783       19          219620      *
##    rs62102979       19          226776      *
##           ...      ...             ...    ...
##    rs12982139       19         2999553      *
##   rs199521660       19         2999724      *
##   rs200315688       19 2999734-2999738      *
##    rs13345197       19         2999759      *
##    rs62125443       19         2999804      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

For purposes of statistical modeling, we have a helper function that retrieves minor allele counts from a tabix-indexed VCF. This returns a GRanges with counts in the mcols. The addresses are bound with the counts.

cnts = minorAlleleCounts(vpath, region=GRanges("19:1-3000000"))
cnts[,1:2]
## GRanges object with 9594 ranges and 2 metadata columns:
##               seqnames          ranges strand |   HG00096   HG00100
##                  <Rle>       <IRanges>  <Rle> | <numeric> <numeric>
##   rs567986644       19          160565      * |         0         1
##   rs548347106       19   183097-183102      * |         1         1
##   rs527871033       19   212785-212786      * |         2         1
##   rs533243783       19          219620      * |         1         1
##    rs62102979       19          226776      * |         1         1
##           ...      ...             ...    ... .       ...       ...
##    rs12982139       19         2999553      * |         1         0
##   rs199521660       19         2999724      * |         0         1
##   rs200315688       19 2999734-2999738      * |         1         0
##    rs13345197       19         2999759      * |         1         0
##    rs62125443       19         2999804      * |         0         2
##   -------
##   seqinfo: 1 sequence from GRCh38 genome; no seqlengths

Example data

The data were extracted from the GWAS tutorial.

zfi = system.file("plink", "chr19_1kglim.zip", package="biocXqtl")
tdir = tempdir()
unzip(zfi, exdir=tdir)
dir(tdir)
##  [1] "BiocStyle"                      "chr19_limsamples_maf10.bed"    
##  [3] "chr19_limsamples_maf10.bim"     "chr19_limsamples_maf10.fam"    
##  [5] "chr19_limsamples_maf10.log"     "file91142cab4512"              
##  [7] "file9114377a83a4"               "file91143dce457d"              
##  [9] "file9114462d36e0"               "file91144e8d4038"              
## [11] "make_bed.sh"                    "rmarkdown-str91142db56957.html"

Using genio

This package seems to be capable only of ‘full reads’.

library(genio)
fi = grep("bed$", dir(tdir, full=TRUE), value=TRUE)
dem = read_plink(fi)
dim(dem$X)
## [1] 163640    104

Using python bed-reader

Here we can make targeted reads from the resource.

## Warning: package 'reticulate' was built under R version 4.5.2
py_require("bed-reader")
br = import("bed_reader")
res = br$open_bed(fi)
toget = tuple(0:4, 0:5)
limread = res$read(toget)
dim(limread)
## [1] 5 6
head(res$bp_position)
## [1] 80840 81039 81806 90854 93234 95981
head(res$chromosome)
## [1] "19" "19" "19" "19" "19" "19"
head(res$sid)
## [1] "rs201816663" "rs879919317" "rs2432259"   "rs200254023" "rs151275984"
## [6] "rs8110113"
head(res$iid)
## [1] "HG00096" "HG00097" "HG00099" "HG00100" "HG00101" "HG00171"

Appendix

VCF production

Code to produce the demonstration VCF is

library(biocXqtl)
library(VariantAnnotation)
data(mageSE_19) # produced from MAGE "DESeq2" quantifications
cc = colnames(mageSE_19)
data(tiles38)
library(GenomeInfoDb)
seqlevelsStyle(tiles38) = "Ensembl"
t19 = tiles38[which(seqnames(tiles38)=="19")]
bigvcf = filterVcfByGenotypeCounts("~/MAGE/ALL.chr19.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz", samples=cc, min_per_genotype=10, regions=range(t19))

The vcf.gz file was obtained following instructions at https://registry.opendata.aws/1000-genomes/.

The code given above runs on a 32 GB macbook in a few minutes, with the following report:

Total samples in VCF: 2504
Analyzing 731 samples
Reading VCF...
[W::hts_idx_load3] The index file is older than the data file: 
 .../MAGE/ALL.chr19.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.tbi
Read 1816698 variants
Counting genotypes...
Variants passing filter: 156943 / 1816698
  Mean counts - HomRef: 690.8, Het: 25.4, HomAlt: 14.4

Folder GWAS/project1/data/raw in the tutorial download has the relevant contents.

An extract from chr19 is produced using

plink2 \
  --pfile all_hg38 vzs \
  --chr 19 \
  --maf 0.1 \
  --keep kp.txt \
  --make-pgen vzs \
  --allow-extra-chr \
  --out chr19_limsamples_pp1

kp.txt is in inst/plink and is a selection of samples with diverse ancestries.

PLINK bed-format files are produced using

plink2 \
  --pfile chr19_limsamples_pp1 vzs \
  --make-bed \
  --out chr19_limsamples_maf10