Introduction

State of the art work on a human pangenome reference (as of 2023) is described in Liao et al. The first draft “contains 47 phased, diploid assemblies from a cohort of genetically diverse individuals”. The purpose of this R package is to demonstrate how Bioconductor data structures can be used to explore resources from the first draft.

Minigraphs presented as GRanges

From the Liao et al. paper:

Minigraph builds a pangenome by starting from a reference assembly, here GRCh38, and iteratively and progressively adds in additional assemblies, recording only SVs ≥ 50 bases. It admits complex variants, including duplications and inversions.

library(BiocHPRClk)
library(S4Vectors)
data(minigr_GRCh38)
minigr_GRCh38
## GRanges object with 77071 ranges and 11 metadata columns:
##                   seqnames      ranges strand |        V4        V5        V6
##                      <Rle>   <IRanges>  <Rle> | <integer> <integer> <integer>
##       [1]             chr1 10621-10936      * |        15        11         0
##       [2]             chr1 20818-20863      * |         4         2         0
##       [3]             chr1 30892-30949      * |         4         2         0
##       [4]             chr1       59598      * |         3         2         0
##       [5]             chr1 65523-66620      * |         6         3         0
##       ...              ...         ...    ... .       ...       ...       ...
##   [77067] chrUn_GL000216v2 69298-69348      * |         4         2         0
##   [77068] chrUn_GL000216v2 70413-70468      * |         3         2         0
##   [77069] chrUn_GL000216v2       73381      * |         3         2         0
##   [77070] chrUn_GL000216v2 73640-73744      * |         4         2         0
##   [77071] chrUn_GL000216v2       74544      * |         3         2         0
##                  V7        V8        V9       V10       V11
##           <integer> <integer> <integer> <integer> <integer>
##       [1]       315       551        -1        -1        -1
##       [2]        45      2070        -1        -1        -1
##       [3]        25        57        -1        -1        -1
##       [4]         0       315        -1        -1        -1
##       [5]        40      1097        -1        -1        -1
##       ...       ...       ...       ...       ...       ...
##   [77067]        10        50        -1        -1        -1
##   [77068]         0        55        -1        -1        -1
##   [77069]         0        50        -1        -1        -1
##   [77070]        49       104        -1        -1        -1
##   [77071]         0       780        -1        -1        -1
##                              V12           shortest_path
##                      <character>          <DNAStringSet>
##       [1] s1,s409786,s408358,s.. TTGCAAAGGC...ACCGCGCCGG
##       [2]       s6,s253481,s7,s8 GTGCATCCAG...GAAAACAGAG
##       [3]      s8,s366808,s9,s10 TCTATCTCTA...TTTCTCTCTC
##       [4]        s10,s253480,s11                       N
##       [5] s11,s403012,s12,s13,.. ATGCTACTAT...TTTTCACAGA
##       ...                    ...                     ...
##   [77067] s185489,s368091,s185..              CCCATTCAGG
##   [77068] s185491,s185492,s185..                       N
##   [77069] s185493,s376236,s185..                       N
##   [77070] s185494,s376237,s185.. TCATTCCATT...TCCATTCGAG
##   [77071] s185496,s344851,s185..                       N
##                      longest_path
##                    <DNAStringSet>
##       [1] TTGCAAAGGC...GCCGTGCTGC
##       [2] CCCTGGACTC...CAGCCTGGGA
##       [3] ATTTCTCTCT...TCTCTCTCTT
##       [4] AGGATTCTTT...GCGCCCGGCC
##       [5] TATGCCTCAT...AAATATATAT
##       ...                     ...
##   [77067] TCCTTTAGAG...TTCATTCCAT
##   [77068] TCCATTCTGT...TCCATTCCAT
##   [77069] CATTCGAATT...CATTCCATTC
##   [77070] CCATTCCATT...TCCATTCCAT
##   [77071] ACCATTCCAT...TTGAGTCCAT
##   -------
##   seqinfo: 54 sequences from GRCh38 genome; no seqlengths

There is a description of the columns in the table:

cat(metadata(minigr_GRCh38)[[1]])
## The output is a BED-like file. The first three columns give the position of a bubble/variation and the rest of columns are:
## 
## (4) # GFA segments in the bubble including the source and the sink of the bubble
## (5) # all possible paths through the bubble (not all paths present in input samples)
## (6) 1 if the bubble involves an inversion; 0 otherwise
## (7) length of the shortest path (i.e. allele) through the bubble
## (8) length of the longest path/allele through the bubble
## (9-11) please ignore
## (12) list of segments in the bubble; first for the source and last for the sink
## (13) sequence of the shortest path (* if zero length)
## (14) sequence of the longest path (NB: it may not be present in the input samples)