GRCh38 reference paths in human pangenome
0
0
Entering edit mode
7 weeks ago
Rugare • 0

Hello, I am following the below tutorial from nvidia for aligning short reads to a pangenome before variant calling with DeepVariant:

# Extract the list of paths corresponding to GRCh38
docker run --rm \
    -v $(pwd):/workdir \
    --workdir /workdir \
    quay.io/vgteam/vg:v1.59.0 \
    vg paths -x hprc-v1.0-mc-grch38-minaf.0.1.xg -L > hprc-v1.0-mc-grch38-minaf.0.1.paths

# Filter paths list
grep -v _decoy hprc-v1.0-mc-grch38-minaf.0.1.paths \
    | grep -v _random \
    | grep -v chrUn_ \
    | grep -v chrEBV \
    | grep -v chrM \
    | grep -v chain_ > hprc-v1.0-mc-grch38-minaf.0.1.paths.sub

# Extract the corresponding sequences to a FASTA file
docker run --rm \
    -v $(pwd):/workdir \
    --workdir /workdir \
    quay.io/vgteam/vg:v1.59.0 \
    vg paths -x hprc-v1.0-mc-grch38-minaf.0.1.xg -p hprc-v1.0-mc-grch38-minaf.0.1.paths.sub -F > hprc-v1.0-mc-grch38-minaf.0.1.fa

# Index the fasta file
samtools faidx hprc-v1.0-mc-grch38-minaf.0.1.fa

I notice all the GRCh38 paths are haplotype_id #0#. I would have expected some paths to be #1# or #2#

My question is when I "Extract the corresponding sequences to a FASTA file" in the tutorial, is the output FASTA file meaningfully different to a linear copy of GRCh38 I could get for elsewhere given all haplotype_ids are #0#? If so, how?

For context, I aim to compare variants called after pangenome alignment vs linear reference alignment.

Thanks in advance! R

HPRC GRCh38 • 322 views
ADD COMMENT
1
Entering edit mode

The GRCh38 reference genome is "haploid" (or put differently: it is not a "phased assembly") so there is only 1 copy of all the chromosomes.

The PanSN spec (https://github.com/pangenome/PanSN-spec, which inspires the # namings) haplotype numbering system isn't meticulously specified so they maybe arbitrarily numbered the haplotype with 0.

My question is when I "Extract the corresponding sequences to a FASTA file" in the tutorial, is the output FASTA file meaningfully different to a linear copy of GRCh38 I could get for elsewhere given all haplotype_ids are #0#? If so, how?

I would guess that it is not meaningfully different, if different at all, to the GRCh38 linear assembly that you can get anywhere

ADD REPLY

Login before adding your answer.

Traffic: 2024 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6