Query regarding interpretation of VCF from variant calling (SVs) using vg giraffe workflow
0
0
Entering edit mode
7 weeks ago
s_135 ▴ 10

I am creating a pangenome and then trying to call variants (SVs) using short read data. I proceeded with the vg tool using vg autoindex workflow as follows (with explanation of each step):

myfa="teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa"  
myvcf="samples_delly.vcf.gz"                            
myprefix="pangenome_graph_ai"                         
echo -e "ANP1\nANP4\nPB27\nTNT17" > sample.lst

# Step 1: Autoindex the graph for alignment
vg autoindex --workflow giraffe --prefix ${myprefix} --ref-fasta ${myfa} --vcf ${myvcf} --threads 60 --tmp-dir ../tmp

# Step 2: Giraffe mapping of short reads to the graph
mkdir -p mygam 
cat sample.lst | parallel -j 5 -k "vg giraffe -Z ${myprefix}.giraffe.gbz -m ${myprefix}.min -d ${myprefix}.dist -f {}_R1_paired_trimmed.fq.gz -f {}_R2_paired_trimmed.fq.gz -t 8 -N {} > mygam/{}.gam"

# Step 3: Count read support
mkdir -p mypack 
cat sample.lst | parallel -j 5 -k "vg pack -x ${myprefix}.giraffe.gbz -t 8 -g mygam/{}.gam -o mypack/{}.pack -Q 5 -s 5"

# Step 4: Calculate snarls from the graph
vg snarls ${myprefix}.giraffe.gbz > ${myprefix}.snarls

# Step 5: Genotype the variants and create VCF files
mkdir -p myvcf  
cat sample.lst | parallel -j 10 -k "vg call ${myprefix}.giraffe.gbz -z -a -k mypack/{}.pack -s {} -t 8 -r ${myprefix}.snarls > myvcf/{}_variants.vcf"

My FASTA file snippet is as follows:

>Pseudomolecule_01
gacaaaggggaagtttgagatattttgattacacaagaacaatgaaacattaactgcatgatattagctacggcaataagaaataataaaggactgactg
ttgaagaacttcctggtcaacttgaggtattttatatgttaaacaaagtcaactttttctgttaactcagattgaacaaacacacacatgaaccatccaa
acgttggcttattaatctttggattattgcgtttacaattcttgcacttcttttcttttcttcaaaaactcattcaaaactgacatagcttcagtgaact
gagcagaatcccaagatgcctgcttattgcgctcatagtgattttcaccatcagagaggactgcagaatgtgatgtatactgactccctcgactatcctc
tggatcaattatatcagttggtaaagcccactccatctttttcttatctactaacttctgtcgcaaattcagctttttagcctcaacaatatcaccttct
tttatcagttccaattgagaaacacaattctccacctaaatcaaacacatgataTtttagctccaataacacactaccagtatgatgagtagaagTccaa
aatgagcacaataagcttatacaaactcacctctgccctgcttgctcgaaactgaaaacaatAaaaacacgttttgttcagaaggctattcagtgtatta
acaagcaatggattgtatgctggCgaaacaaggtcaatgtgaccacaatgacctgtgcaatgataAgcacgttggccacatgattgacatctgcatataa
caagatAagcatgaatacacattaatacactgcatgagatttttatccagattatatatacgtacttggATgcatcatcaaggggtcccatagccgaatc
ataaagtccgccagggattggcttgtccagtatgtcaagcaaattcggattcgtgatcttcacaacgctctggcgcctcacttctacatccgtcatgaac
ccaaaatgcacggcctccacaacctcagtcgcctaaataaaaagtgccaaacatcagaGaatagctagcaaaaaaatttattaccgcatgaaaagaatac
attaatagctaaatttgtttccaattcataatttagggatttaattccaaaaactcattcaagacaataacaaaataaacaatatcatgatgttttaaag
caacaacagcaacaacaactaaagggaaaaATACAATGAACATATTGTTTTACCATGGTTCACCAGTTGGCCCAAGGTCGGAATGGCCTAAGTATCAATA
ACTTCGAATTGCTTAGAAATAAATGACAAAAATTCAACGAGAAAAATATCAATTTTCTGAAAACTGAAGGCCAATTCGGTGcttccagCttcttgaagag
aagcttggttcccagctggtgcggtttgtaaactatacaaactggggTtttggaaaaccctatcagacaAaggtatatttGagaCagagagggggGGagg
cgcagatgggtcgggcccaccataaattagaacagtcattacctgaaaaaatcagcattataataacaaAtattctctttaaactttcctccaaattaaa
caaaaaaaaaaaaaaaatctttaaactttgcttcattttttattcttgaattacctcaaaatacacaatatatattaaataattttacaaATatatatat
atgacatatattaaaaaatattaaaataaaagatgaaatattatttgtaatagatagatggatccactgtttttcaattgtttcttgtttatttaattta

I used a VCF input from delly tool since the original VCF was not having any SVs. Its snippet is given below:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SO_11111_ANP1_SM        SO_11111_ANP4_SM        SO_11111_PB27_SM        SO_11111_TNT17_2_SM
Pseudomolecule_01       16422   DEL00000000     A       <DEL>   120     PASS    PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=16490;PE=0;MAPQ=0;CT=3to5;CIPOS=-10,10;CIEND=-10,10;SRMAPQ=22;INSLEN=0;HOMLEN=10;SR=6;SRQ=0.972414;CON>
Pseudomolecule_01       16422   DEL00000001     A       <DEL>   291     LowQual PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=1015460;PE=5;MAPQ=14;CT=3to5;CIPOS=-10,10;CIEND=-10,10;SRMAPQ=15;INSLEN=0;HOMLEN=10;SR=12;SRQ=0.958824>
Pseudomolecule_01       21134   DEL00000002     T       <DEL>   300     PASS    PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=21165;PE=0;MAPQ=0;CT=3to5;CIPOS=-22,22;CIEND=-22,22;SRMAPQ=60;INSLEN=0;HOMLEN=22;SR=5;SRQ=1;CONSENSUS=>
Pseudomolecule_01       24944   DEL00000003     A       <DEL>   46      LowQual IMPRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=1022920;PE=2;MAPQ=23;CT=3to5;CIPOS=-50,50;CIEND=-50,50       GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01       49789   DEL00000004     G       <DEL>   175     PASS    PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=49824;PE=0;MAPQ=0;CT=3to5;CIPOS=-27,27;CIEND=-27,27;SRMAPQ=39;INSLEN=0;HOMLEN=33;SR=4;SRQ=1;CONSENSUS=>
Pseudomolecule_01       56157   DEL00000005     T       <DEL>   40      LowQual IMPRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=8426671;PE=3;MAPQ=8;CT=3to5;CIPOS=-152,152;CIEND=-152,152    GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01       62342   DEL00000006     T       <DEL>   179     PASS    PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=63235;PE=0;MAPQ=0;CT=3to5;CIPOS=-8,8;CIEND=-8,8;SRMAPQ=30;INSLEN=0;HOMLEN=7;SR=6;SRQ=0.966667;CONSENSU>
Pseudomolecule_01       68825   DEL00000007     T       <DEL>   225     PASS    PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=908487;PE=0;MAPQ=0;CT=3to5;CIPOS=-20,20;CIEND=-20,20;SRMAPQ=60;INSLEN=0;HOMLEN=19;SR=4;SRQ=0.964912;CO>
Pseudomolecule_01       86975   INV00000008     T       <INV>   15      LowQual IMPRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=1288211;PE=2;MAPQ=12;CT=5to5;CIPOS=-130,130;CIEND=-130,130   GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01       91276   DEL00000009     C       <DEL>   279     PASS    PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=91338;PE=0;MAPQ=0;CT=3to5;CIPOS=-3,3;CIEND=-3,3;SRMAPQ=23;INSLEN=0;HOMLEN=3;SR=10;SRQ=0.988827;CONSENS>
Pseudomolecule_01       95433   INV00000010     A       <INV>   30      LowQual IMPRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=1279605;PE=2;MAPQ=15;CT=3to3;CIPOS=-154,154;CIEND=-154,154   GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01       102165  INV00000011     A       <INV>   79      LowQual IMPRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=1272672;PE=2;MAPQ=40;CT=5to5;CIPOS=-131,131;CIEND=-131,131   GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01       105411  INV00000012     T       <INV>   54      LowQual PRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=18509135;PE=0;MAPQ=0;CT=3to3;CIPOS=-2,2;CIEND=-2,2;SRMAPQ=14;INSLEN=0;HOMLEN=1;SR=4;SRQ=0.963504;CONSE>
Pseudomolecule_01       114586  DEL00000013     A       <DEL>   50      PASS    PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=114683;PE=0;MAPQ=0;CT=3to5;CIPOS=-3,3;CIEND=-3,3;SRMAPQ=21;INSLEN=0;HOMLEN=2;SR=3;SRQ=0.963964;CONSENS>
Pseudomolecule_01       116815  INV00000014     G       <INV>   77      LowQual IMPRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=1251982;PE=2;MAPQ=56;CT=5to5;CIPOS=-173,173;CIEND=-173,173   GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01       132604  DEL00000015     A       <DEL>   180     PASS    PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=132636;PE=0;MAPQ=0;CT=3to5;CIPOS=-5,5;CIEND=-5,5;SRMAPQ=60;INSLEN=0;HOMLEN=4;SR=3;SRQ=0.973684;CONSENS>
Pseudomolecule_01       144246  DUP00000016     A       <DUP>   142     LowQual PRECISE;SVTYPE=DUP;SVMETHOD=EMBL.DELLYv0.9.1;END=144483;PE=0;MAPQ=0;CT=5to3;CIPOS=-6,6;CIEND=-6,6;SRMAPQ=11;INSLEN=0;HOMLEN=5;SR=5;SRQ=1;CONSENSUS=CCCC>
Pseudomolecule_01       176103  INV00000017     A       <INV>   69      LowQual IMPRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=1190779;PE=3;MAPQ=18;CT=3to3;CIPOS=-96,96;CIEND=-96,96       GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01       195356  DUP00000018     C       <DUP>   519     LowQual IMPRECISE;SVTYPE=DUP;SVMETHOD=EMBL.DELLYv0.9.1;END=195675;PE=45;MAPQ=10;CT=5to3;CIPOS=-50,50;CIEND=-50,50       GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01       197169  DUP00000019     A       <DUP>   86      PASS    IMPRECISE;SVTYPE=DUP;SVMETHOD=EMBL.DELLYv0.9.1;END=197459;PE=5;MAPQ=20;CT=5to3;CIPOS=-50,50;CIEND=-50,50        GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01       197677  DUP00000020     A       <DUP>   50      LowQual IMPRECISE;SVTYPE=DUP;SVMETHOD=EMBL.DELLYv0.9.1;END=197992;PE=3;MAPQ=11;CT=5to3;CIPOS=-50,50;CIEND=-50,50        GT:GL:GQ:FT:RCL:RC:RCR:RDCN:DR:DV:RR:RV>
Pseudomolecule_01       199865  DEL00000021     T       <DEL>   360     PASS    PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=201147;PE=0;MAPQ=0;CT=3to5;CIPOS=-10,10;CIEND=-10,10;SRMAPQ=60;INSLEN=0;HOMLEN=9;SR=6;SRQ=0.99375;CONS>
Pseudomolecule_01       215608  DEL00000022     A       <DEL>   420     PASS    PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=215643;PE=0;MAPQ=0;CT=3to5;CIPOS=-2,2;CIEND=-2,2;SRMAPQ=60;INSLEN=0;HOMLEN=2;SR=7;SRQ=1;CONSENSUS=CAAT>
Pseudomolecule_01       221392  DEL00000023     A       <DEL>   540     PASS    PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=221427;PE=0;MAPQ=0;CT=3to5;CIPOS=-6,6;CIEND=-6,6;SRMAPQ=60;INSLEN=0;HOMLEN=7;SR=9;SRQ=0.994505;CONSENS>
Pseudomolecule_01       256580  INV00000024     T       <INV>   368     PASS    PRECISE;SVTYPE=INV;SVMETHOD=EMBL.DELLYv0.9.1;END=7025740;PE=2;MAPQ=29;CT=5to5;CIPOS=-2,2;CIEND=-2,2;SRMAPQ=32;INSLEN=0;HOMLEN=1;SR=9;SRQ=0.982143;CONSE>
Pseudomolecule_01       266831  DEL00000025     A       <DEL>   1320    PASS    PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=276416;PE=2;MAPQ=60;CT=3to5;CIPOS=-10,10;CIEND=-10,10;SRMAPQ=60;INSLEN=0;HOMLEN=9;SR=20;SRQ=0.990244;C>
Pseudomolecule_01       282608  DEL00000026     T       <DEL>   1260    PASS    PRECISE;SVTYPE=DEL;SVMETHOD=EMBL.DELLYv0.9.1;END=296795;PE=2;MAPQ=60;CT=3to5;CIPOS=-7,7;CIEND=-7,7;SRMAPQ=60;INSLEN=0;HOMLEN=6;SR=19;SRQ=1;CONSENSUS=TG>

But the final step of calling variants using vg call is giving vcf files where the variations are same throughout difference samples (ANP1, ANP4 etc). The variant VCFs are having different file size but same variation data? Why is this so?

Edit: So, I ran vg call again changing a parameter (removing "-a, --genotype-snarls Genotype every snarl, including reference calls (use to compare multiple samples)") and got slightly different results. What do I infer from this? Which should I proceed with? Asking because I'm new to handling and interpreting VCF files.

Note: the 1st and 3rd images are from older command, 2nd and 4th are modified command. They are color coded according to their region (same region visualized from both files) enter image description here

enter image description here

enter image description here

enter image description here

pangenome giraffe vcf vg • 217 views
ADD COMMENT

Login before adding your answer.

Traffic: 1293 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6