vignettes/BiocHPRClk.Rmd
BiocHPRClk.Rmd
State of the art work on a human pangenome reference (as of 2023) is described in Liao et al. The first draft “contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals”. The purpose of this R package is to demonstrate how Bioconductor data structures can be used to explore resources from the first draft.
From the Liao et al. paper:
Minigraph builds a pangenome by starting from a reference assembly, here GRCh38, and iteratively and progressively adds in additional assemblies, recording only SVs ≥ 50 bases. It admits complex variants, including duplications and inversions.
library(BiocHPRClk)
library(S4Vectors)
data(minigr_GRCh38)
minigr_GRCh38
## GRanges object with 77071 ranges and 11 metadata columns:
## seqnames ranges strand | V4 V5 V6
## <Rle> <IRanges> <Rle> | <integer> <integer> <integer>
## [1] chr1 10621-10936 * | 15 11 0
## [2] chr1 20818-20863 * | 4 2 0
## [3] chr1 30892-30949 * | 4 2 0
## [4] chr1 59598 * | 3 2 0
## [5] chr1 65523-66620 * | 6 3 0
## ... ... ... ... . ... ... ...
## [77067] chrUn_GL000216v2 69298-69348 * | 4 2 0
## [77068] chrUn_GL000216v2 70413-70468 * | 3 2 0
## [77069] chrUn_GL000216v2 73381 * | 3 2 0
## [77070] chrUn_GL000216v2 73640-73744 * | 4 2 0
## [77071] chrUn_GL000216v2 74544 * | 3 2 0
## V7 V8 V9 V10 V11
## <integer> <integer> <integer> <integer> <integer>
## [1] 315 551 -1 -1 -1
## [2] 45 2070 -1 -1 -1
## [3] 25 57 -1 -1 -1
## [4] 0 315 -1 -1 -1
## [5] 40 1097 -1 -1 -1
## ... ... ... ... ... ...
## [77067] 10 50 -1 -1 -1
## [77068] 0 55 -1 -1 -1
## [77069] 0 50 -1 -1 -1
## [77070] 49 104 -1 -1 -1
## [77071] 0 780 -1 -1 -1
## V12 shortest_path
## <character> <DNAStringSet>
## [1] s1,s409786,s408358,s.. TTGCAAAGGC...ACCGCGCCGG
## [2] s6,s253481,s7,s8 GTGCATCCAG...GAAAACAGAG
## [3] s8,s366808,s9,s10 TCTATCTCTA...TTTCTCTCTC
## [4] s10,s253480,s11 N
## [5] s11,s403012,s12,s13,.. ATGCTACTAT...TTTTCACAGA
## ... ... ...
## [77067] s185489,s368091,s185.. CCCATTCAGG
## [77068] s185491,s185492,s185.. N
## [77069] s185493,s376236,s185.. N
## [77070] s185494,s376237,s185.. TCATTCCATT...TCCATTCGAG
## [77071] s185496,s344851,s185.. N
## longest_path
## <DNAStringSet>
## [1] TTGCAAAGGC...GCCGTGCTGC
## [2] CCCTGGACTC...CAGCCTGGGA
## [3] ATTTCTCTCT...TCTCTCTCTT
## [4] AGGATTCTTT...GCGCCCGGCC
## [5] TATGCCTCAT...AAATATATAT
## ... ...
## [77067] TCCTTTAGAG...TTCATTCCAT
## [77068] TCCATTCTGT...TCCATTCCAT
## [77069] CATTCGAATT...CATTCCATTC
## [77070] CCATTCCATT...TCCATTCCAT
## [77071] ACCATTCCAT...TTGAGTCCAT
## -------
## seqinfo: 54 sequences from GRCh38 genome; no seqlengths
There is a description of the columns in the table:
## The output is a BED-like file. The first three columns give the position of a bubble/variation and the rest of columns are:
##
## (4) # GFA segments in the bubble including the source and the sink of the bubble
## (5) # all possible paths through the bubble (not all paths present in input samples)
## (6) 1 if the bubble involves an inversion; 0 otherwise
## (7) length of the shortest path (i.e. allele) through the bubble
## (8) length of the longest path/allele through the bubble
## (9-11) please ignore
## (12) list of segments in the bubble; first for the source and last for the sink
## (13) sequence of the shortest path (* if zero length)
## (14) sequence of the longest path (NB: it may not be present in the input samples)