I am creating a pangenome and then trying to call variants (SVs) using short read data. I proceeded with the vg tool using vg autoindex
workflow as follows (with explanation of each step):
myfa="teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa"
myvcf="samples_delly.vcf.gz"
myprefix="pangenome_graph_ai"
echo -e "ANP1\nANP4\nPB27\nTNT17" > sample.lst
# Step 1: Autoindex the graph for alignment
vg autoindex --workflow giraffe --prefix ${myprefix} --ref-fasta ${myfa} --vcf ${myvcf} --threads 60 --tmp-dir ../tmp
# Step 2: Giraffe mapping of short reads to the graph
mkdir -p mygam
cat sample.lst | parallel -j 5 -k "vg giraffe -Z ${myprefix}.giraffe.gbz -m ${myprefix}.min -d ${myprefix}.dist -f {}_R1_paired_trimmed.fq.gz -f {}_R2_paired_trimmed.fq.gz -t 8 -N {} > mygam/{}.gam"
# Step 3: Count read support
mkdir -p mypack
cat sample.lst | parallel -j 5 -k "vg pack -x ${myprefix}.giraffe.gbz -t 8 -g mygam/{}.gam -o mypack/{}.pack -Q 5 -s 5"
# Step 4: Calculate snarls from the graph
vg snarls ${myprefix}.giraffe.gbz > ${myprefix}.snarls
# Step 5: Genotype the variants and create VCF files
mkdir -p myvcf
cat sample.lst | parallel -j 10 -k "vg call ${myprefix}.giraffe.gbz -z -a -k mypack/{}.pack -s {} -t 8 -r ${myprefix}.snarls > myvcf/{}_variants.vcf"
My FASTA file snippet is as follows:
>Pseudomolecule_01
gacaaaggggaagtttgagatattttgattacacaagaacaatgaaacattaactgcatgatattagctacggcaataagaaataataaaggactgactg
ttgaagaacttcctggtcaacttgaggtattttatatgttaaacaaagtcaactttttctgttaactcagattgaacaaacacacacatgaaccatccaa
acgttggcttattaatctttggattattgcgtttacaattcttgcacttcttttcttttcttcaaaaactcattcaaaactgacatagcttcagtgaact
gagcagaatcccaagatgcctgcttattgcgctcatagtgattttcaccatcagagaggactgcagaatgtgatgtatactgactccctcgactatcctc
tggatcaattatatcagttggtaaagcccactccatctttttcttatctactaacttctgtcgcaaattcagctttttagcctcaacaatatcaccttct
tttatcagttccaattgagaaacacaattctccacctaaatcaaacacatgataTtttagctccaataacacactaccagtatgatgagtagaagTccaa
aatgagcacaataagcttatacaaactcacctctgccctgcttgctcgaaactgaaaacaatAaaaacacgttttgttcagaaggctattcagtgtatta
acaagcaatggattgtatgctggCgaaacaaggtcaatgtgaccacaatgacctgtgcaatgataAgcacgttggccacatgattgacatctgcatataa
caagatAagcatgaatacacattaatacactgcatgagatttttatccagattatatatacgtacttggATgcatcatcaaggggtcccatagccgaatc
ataaagtccgccagggattggcttgtccagtatgtcaagcaaattcggattcgtgatcttcacaacgctctggcgcctcacttctacatccgtcatgaac
ccaaaatgcacggcctccacaacctcagtcgcctaaataaaaagtgccaaacatcagaGaatagctagcaaaaaaatttattaccgcatgaaaagaatac
attaatagctaaatttgtttccaattcataatttagggatttaattccaaaaactcattcaagacaataacaaaataaacaatatcatgatgttttaaag
caacaacagcaacaacaactaaagggaaaaATACAATGAACATATTGTTTTACCATGGTTCACCAGTTGGCCCAAGGTCGGAATGGCCTAAGTATCAATA
ACTTCGAATTGCTTAGAAATAAATGACAAAAATTCAACGAGAAAAATATCAATTTTCTGAAAACTGAAGGCCAATTCGGTGcttccagCttcttgaagag
aagcttggttcccagctggtgcggtttgtaaactatacaaactggggTtttggaaaaccctatcagacaAaggtatatttGagaCagagagggggGGagg
cgcagatgggtcgggcccaccataaattagaacagtcattacctgaaaaaatcagcattataataacaaAtattctctttaaactttcctccaaattaaa
caaaaaaaaaaaaaaaatctttaaactttgcttcattttttattcttgaattacctcaaaatacacaatatatattaaataattttacaaATatatatat
atgacatatattaaaaaatattaaaataaaagatgaaatattatttgtaatagatagatggatccactgtttttcaattgtttcttgtttatttaattta
I used a VCF input from delly tool since the original VCF was not having any SVs. Its snippet is given below:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SO_11111_ANP1_SM SO_11111_ANP4_SM SO_11111_PB27_SM SO_11111_TNT17_2_SM
Pseudomolecule_01 16422 DEL00000000 A <DEL> 120 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=16490;PE=0;MAPQ=0;CT=3to5;CIPOS=-10,10;CIEND=-10,10;SRMAPQ=22;INSLEN=0;HOMLEN=10;SR=6;SRQ=0.972414;CON>
Pseudomolecule_01 16422 DEL00000001 A <DEL> 291 LowQual PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=1015460;PE=5;MAPQ=14;CT=3to5;CIPOS=-10,10;CIEND=-10,10;SRMAPQ=15;INSLEN=0;HOMLEN=10;SR=12;SRQ=0.958824>
Pseudomolecule_01 21134 DEL00000002 T <DEL> 300 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=21165;PE=0;MAPQ=0;CT=3to5;CIPOS=-22,22;CIEND=-22,22;SRMAPQ=60;INSLEN=0;HOMLEN=22;SR=5;SRQ=1;CONSENSUS=>
Pseudomolecule_01 24944 DEL00000003 A <DEL> 46 LowQual IMPRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=1022920;PE=2;MAPQ=23;CT=3to5;CIPOS=-50,50;CIEND=-50,50 GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01 49789 DEL00000004 G <DEL> 175 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=49824;PE=0;MAPQ=0;CT=3to5;CIPOS=-27,27;CIEND=-27,27;SRMAPQ=39;INSLEN=0;HOMLEN=33;SR=4;SRQ=1;CONSENSUS=>
Pseudomolecule_01 56157 DEL00000005 T <DEL> 40 LowQual IMPRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=8426671;PE=3;MAPQ=8;CT=3to5;CIPOS=-152,152;CIEND=-152,152 GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01 62342 DEL00000006 T <DEL> 179 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=63235;PE=0;MAPQ=0;CT=3to5;CIPOS=-8,8;CIEND=-8,8;SRMAPQ=30;INSLEN=0;HOMLEN=7;SR=6;SRQ=0.966667;CONSENSU>
Pseudomolecule_01 68825 DEL00000007 T <DEL> 225 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=908487;PE=0;MAPQ=0;CT=3to5;CIPOS=-20,20;CIEND=-20,20;SRMAPQ=60;INSLEN=0;HOMLEN=19;SR=4;SRQ=0.964912;CO>
Pseudomolecule_01 86975 INV00000008 T <INV> 15 LowQual IMPRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=1288211;PE=2;MAPQ=12;CT=5to5;CIPOS=-130,130;CIEND=-130,130 GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01 91276 DEL00000009 C <DEL> 279 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=91338;PE=0;MAPQ=0;CT=3to5;CIPOS=-3,3;CIEND=-3,3;SRMAPQ=23;INSLEN=0;HOMLEN=3;SR=10;SRQ=0.988827;CONSENS>
Pseudomolecule_01 95433 INV00000010 A <INV> 30 LowQual IMPRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=1279605;PE=2;MAPQ=15;CT=3to3;CIPOS=-154,154;CIEND=-154,154 GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01 102165 INV00000011 A <INV> 79 LowQual IMPRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=1272672;PE=2;MAPQ=40;CT=5to5;CIPOS=-131,131;CIEND=-131,131 GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01 105411 INV00000012 T <INV> 54 LowQual PRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=18509135;PE=0;MAPQ=0;CT=3to3;CIPOS=-2,2;CIEND=-2,2;SRMAPQ=14;INSLEN=0;HOMLEN=1;SR=4;SRQ=0.963504;CONSE>
Pseudomolecule_01 114586 DEL00000013 A <DEL> 50 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=114683;PE=0;MAPQ=0;CT=3to5;CIPOS=-3,3;CIEND=-3,3;SRMAPQ=21;INSLEN=0;HOMLEN=2;SR=3;SRQ=0.963964;CONSENS>
Pseudomolecule_01 116815 INV00000014 G <INV> 77 LowQual IMPRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=1251982;PE=2;MAPQ=56;CT=5to5;CIPOS=-173,173;CIEND=-173,173 GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01 132604 DEL00000015 A <DEL> 180 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=132636;PE=0;MAPQ=0;CT=3to5;CIPOS=-5,5;CIEND=-5,5;SRMAPQ=60;INSLEN=0;HOMLEN=4;SR=3;SRQ=0.973684;CONSENS>
Pseudomolecule_01 144246 DUP00000016 A <DUP> 142 LowQual PRECISE;SVTYPE=DUP;SVMETHOD=EMBL.DELLYv0.9.1;END=144483;PE=0;MAPQ=0;CT=5to3;CIPOS=-6,6;CIEND=-6,6;SRMAPQ=11;INSLEN=0;HOMLEN=5;SR=5;SRQ=1;CONSENSUS=CCCC>
Pseudomolecule_01 176103 INV00000017 A <INV> 69 LowQual IMPRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=1190779;PE=3;MAPQ=18;CT=3to3;CIPOS=-96,96;CIEND=-96,96 GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01 195356 DUP00000018 C <DUP> 519 LowQual IMPRECISE;SVTYPE=DUP;SVMETHOD=EMBL.DELLYv0.9.1;END=195675;PE=45;MAPQ=10;CT=5to3;CIPOS=-50,50;CIEND=-50,50 GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01 197169 DUP00000019 A <DUP> 86 PASS IMPRECISE;SVTYPE=DUP;SVMETHOD=EMBL.DELLYv0.9.1;END=197459;PE=5;MAPQ=20;CT=5to3;CIPOS=-50,50;CIEND=-50,50 GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01 197677 DUP00000020 A <DUP> 50 LowQual IMPRECISE;SVTYPE=DUP;SVMETHOD=EMBL.DELLYv0.9.1;END=197992;PE=3;MAPQ=11;CT=5to3;CIPOS=-50,50;CIEND=-50,50 GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01 199865 DEL00000021 T <DEL> 360 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=201147;PE=0;MAPQ=0;CT=3to5;CIPOS=-10,10;CIEND=-10,10;SRMAPQ=60;INSLEN=0;HOMLEN=9;SR=6;SRQ=0.99375;CONS>
Pseudomolecule_01 215608 DEL00000022 A <DEL> 420 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=215643;PE=0;MAPQ=0;CT=3to5;CIPOS=-2,2;CIEND=-2,2;SRMAPQ=60;INSLEN=0;HOMLEN=2;SR=7;SRQ=1;CONSENSUS=CAAT>
Pseudomolecule_01 221392 DEL00000023 A <DEL> 540 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=221427;PE=0;MAPQ=0;CT=3to5;CIPOS=-6,6;CIEND=-6,6;SRMAPQ=60;INSLEN=0;HOMLEN=7;SR=9;SRQ=0.994505;CONSENS>
Pseudomolecule_01 256580 INV00000024 T <INV> 368 PASS PRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=7025740;PE=2;MAPQ=29;CT=5to5;CIPOS=-2,2;CIEND=-2,2;SRMAPQ=32;INSLEN=0;HOMLEN=1;SR=9;SRQ=0.982143;CONSE>
Pseudomolecule_01 266831 DEL00000025 A <DEL> 1320 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=276416;PE=2;MAPQ=60;CT=3to5;CIPOS=-10,10;CIEND=-10,10;SRMAPQ=60;INSLEN=0;HOMLEN=9;SR=20;SRQ=0.990244;C>
Pseudomolecule_01 282608 DEL00000026 T <DEL> 1260 PASS PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=296795;PE=2;MAPQ=60;CT=3to5;CIPOS=-7,7;CIEND=-7,7;SRMAPQ=60;INSLEN=0;HOMLEN=6;SR=19;SRQ=1;CONSENSUS=TG>
But the final step of calling variants using vg call
is giving vcf files where the variations are same throughout difference samples (ANP1, ANP4 etc). The variant VCFs are having different file size but same variation data? Why is this so?
Edit:
So, I ran vg call again
changing a parameter (removing "-a, --genotype-snarls Genotype every snarl, including reference calls (use to compare multiple samples)") and got slightly different results. What do I infer from this? Which should I proceed with? Asking because I'm new to handling and interpreting VCF files.
Note: the 1st and 3rd images are from older command, 2nd and 4th are modified command. They are color coded according to their region (same region visualized from both files)