Entering edit mode
7 weeks ago
Rugare
•
0
Hello, I am following the below tutorial from nvidia for aligning short reads to a pangenome before variant calling with DeepVariant:
# Extract the list of paths corresponding to GRCh38
docker run --rm \
-v $(pwd):/workdir \
--workdir /workdir \
quay.io/vgteam/vg:v1.59.0 \
vg paths -x hprc-v1.0-mc-grch38-minaf.0.1.xg -L > hprc-v1.0-mc-grch38-minaf.0.1.paths
# Filter paths list
grep -v _decoy hprc-v1.0-mc-grch38-minaf.0.1.paths \
| grep -v _random \
| grep -v chrUn_ \
| grep -v chrEBV \
| grep -v chrM \
| grep -v chain_ > hprc-v1.0-mc-grch38-minaf.0.1.paths.sub
# Extract the corresponding sequences to a FASTA file
docker run --rm \
-v $(pwd):/workdir \
--workdir /workdir \
quay.io/vgteam/vg:v1.59.0 \
vg paths -x hprc-v1.0-mc-grch38-minaf.0.1.xg -p hprc-v1.0-mc-grch38-minaf.0.1.paths.sub -F > hprc-v1.0-mc-grch38-minaf.0.1.fa
# Index the fasta file
samtools faidx hprc-v1.0-mc-grch38-minaf.0.1.fa
I notice all the GRCh38 paths are haplotype_id #0#. I would have expected some paths to be #1# or #2#
My question is when I "Extract the corresponding sequences to a FASTA file" in the tutorial, is the output FASTA file meaningfully different to a linear copy of GRCh38 I could get for elsewhere given all haplotype_ids are #0#? If so, how?
For context, I aim to compare variants called after pangenome alignment vs linear reference alignment.
Thanks in advance! R
The GRCh38 reference genome is "haploid" (or put differently: it is not a "phased assembly") so there is only 1 copy of all the chromosomes.
The PanSN spec (https://github.com/pangenome/PanSN-spec, which inspires the # namings) haplotype numbering system isn't meticulously specified so they maybe arbitrarily numbered the haplotype with 0.
I would guess that it is not meaningfully different, if different at all, to the GRCh38 linear assembly that you can get anywhere