Question

GRCh38 reference paths in human pangenome

0

Entering edit mode

7 weeks ago

Rugare • 0

Hello, I am following the below tutorial from nvidia for aligning short reads to a pangenome before variant calling with DeepVariant:

# Extract the list of paths corresponding to GRCh38
docker run --rm \
    -v $(pwd):/workdir \
    --workdir /workdir \
    quay.io/vgteam/vg:v1.59.0 \
    vg paths -x hprc-v1.0-mc-grch38-minaf.0.1.xg -L > hprc-v1.0-mc-grch38-minaf.0.1.paths

# Filter paths list
grep -v _decoy hprc-v1.0-mc-grch38-minaf.0.1.paths \
    | grep -v _random \
    | grep -v chrUn_ \
    | grep -v chrEBV \
    | grep -v chrM \
    | grep -v chain_ > hprc-v1.0-mc-grch38-minaf.0.1.paths.sub

# Extract the corresponding sequences to a FASTA file
docker run --rm \
    -v $(pwd):/workdir \
    --workdir /workdir \
    quay.io/vgteam/vg:v1.59.0 \
    vg paths -x hprc-v1.0-mc-grch38-minaf.0.1.xg -p hprc-v1.0-mc-grch38-minaf.0.1.paths.sub -F > hprc-v1.0-mc-grch38-minaf.0.1.fa

# Index the fasta file
samtools faidx hprc-v1.0-mc-grch38-minaf.0.1.fa

I notice all the GRCh38 paths are haplotype_id #0#. I would have expected some paths to be #1# or #2#

My question is when I "Extract the corresponding sequences to a FASTA file" in the tutorial, is the output FASTA file meaningfully different to a linear copy of GRCh38 I could get for elsewhere given all haplotype_ids are #0#? If so, how?

For context, I aim to compare variants called after pangenome alignment vs linear reference alignment.

Thanks in advance! R

HPRC GRCh38 • 322 views

ADD COMMENT • link updated 6 weeks ago by cmdcolin ★ 4.2k • written 7 weeks ago by Rugare • 0

1

Entering edit mode

The GRCh38 reference genome is "haploid" (or put differently: it is not a "phased assembly") so there is only 1 copy of all the chromosomes.

The PanSN spec (https://github.com/pangenome/PanSN-spec, which inspires the # namings) haplotype numbering system isn't meticulously specified so they maybe arbitrarily numbered the haplotype with 0.

My question is when I "Extract the corresponding sequences to a FASTA file" in the tutorial, is the output FASTA file meaningfully different to a linear copy of GRCh38 I could get for elsewhere given all haplotype_ids are #0#? If so, how?

I would guess that it is not meaningfully different, if different at all, to the GRCh38 linear assembly that you can get anywhere

ADD REPLY • link 6 weeks ago by cmdcolin ★ 4.2k