I have an artificial ref file against which I have mapped artificial reads to understand some things about VCF, calls etc. ...
The reference file that looks like this:
>myseq_contig1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNCCTCGGAAACTCTTAATGGCACATAAACGACCGAATCAATGACTCACCTT
TGTCCGAGTCTCTAACCGTTTAAGGTGAATTAACATACCGGGGAGGTGAATCCAAGGTCT
CAGACCACTTCGGACCGAGATGCTTCGCATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
This ref sequence is 270 in length. The first 70 sites of the reference are "N". Then there's a sequence (140 * [ACGT]). Then again 60 "N".
This is the SAM file of the mapping:
left_1 0 myseq_contig1 1 60 100M * 0 0 ACGTACGACTGACTGAGCGGGCGCATCTATTGCATTCGATCGGAACGATCGATCGACAGCGACTAGCATCCCTCGGAAACTCTTAATGGCACATAAACGA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:30 AS:i:30 XS:i:0
left_2 0 myseq_contig1 1 60 100M * 0 0 ACGTACGACTGACTGAGCGGGCGAATCTATTGCATTCGATTGGAACGATCGATCGAGAGCGACTAGCATCCCTCGGAAACTCTTAATGGCACATAAACGA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:30 AS:i:30 XS:i:0
read1_101 0 myseq_contig1 71 60 100M * 0 0 CCTCGGAAACTCTTAATGGCACATAAACGACCGAATCAATGACTCACCTTTGTCCGAGTCTCTAACCGTTTAAGGTGAATTAACATACCGGGGAGGTGAA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
read21_121 0 myseq_contig1 91 60 100M * 0 0 ACATAAACGACCGAATCAATGACTCACCTTTGTCCGAGTCTCTAACCGTTTAAGGTGAATTAACATACCGGGGAGGTGAATCCAAGGTCTCAGACCACTT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
read41_141 0 myseq_contig1 111 60 100M * 0 0 GACTCACCTTTGTCCGAGTCTCTAACCGTTTAAGGTGAATTAACATACCGGGGAGGTGAATCCAAGGTCTCAGACCACTTCGGACCGAGATGCTTCGCAT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
right1 0 myseq_contig1 171 60 100M * 0 0 TCCAAGGTCTCAGACCACTTCGGACCGAGATGCTTCGCATCCGAGTATCTCGGCAAAGTTATAGGCGTGGGGGCTACGCTCCAAGGCGTTATTTACTTAT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:40 AS:i:40 XS:i:0
I have 3 reads that form a contig at the first non-"N" position in the reference, their read names are read1_101
, read21_121
and read41_141
.
The reads left_1
and left_2
both start at the first "N". They also both perfectly match the first 30 non-"N"-sites in the reference, i.e. they both span the "N"-region at the beginning and then reach into the non-"N"-sequence-part.
The last read (right1) starts at position 171, which is 40 positions before the trailing "N"-region. It reaches until the end, i.e. until the last "N" in the reference.
Now I am confused about the output of this command:
samtools view -h mapping.sam | samtools mpileup -vf ref.fasta - | bcftools call -Mc -Oz - | bcftools view | less -S
The output looks like this (cut-out because my post is too long):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT -
myseq_contig1 211 . N C 6.98265 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9912 GT:PL 1/1:36,3,0
myseq_contig1 212 . N C 4.77219 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9923 GT:PL 0/1:33,3,0
myseq_contig1 213 . N G 3.54557 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9935 GT:PL 0/1:31,3,0
myseq_contig1 214 . N A 3.01618 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9944 GT:PL 0/1:30,3,0
myseq_contig1 215 . N . 3.53256 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9956 GT:PL 0/0:29
myseq_contig1 216 . N . 4.11733 . DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.997 GT:PL 0/0:28
Now I see the following genotypes (last column, column header = "-") at the respective reference-sites:
211: 1/1
212: 0/1
213: 0/1
214: 0/1
211 is where the "N" region starts at the end of the reference. So here he calls the variant 1.
But from position 215 onwards he starts to call 0/0 (which is reference which is "N") until the end.
But positions 215 until end vs positions 211-214 share the same condition: reference is "N" and there is one read (namely "right1") that covers the position. Why does he make different decisions when the conditions are the same?