I don't understand bcftools call behaviour
0
0
Entering edit mode
9.3 years ago
Marvin ▴ 220

I have an artificial ref file against which I have mapped artificial reads to understand some things about VCF, calls etc. ...


The reference file that looks like this:

>myseq_contig1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNCCTCGGAAACTCTTAATGGCACATAAACGACCGAATCAATGACTCACCTT
TGTCCGAGTCTCTAACCGTTTAAGGTGAATTAACATACCGGGGAGGTGAATCCAAGGTCT
CAGACCACTTCGGACCGAGATGCTTCGCATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

This ref sequence is 270 in length. The first 70 sites of the reference are "N". Then there's a sequence (140 * [ACGT]). Then again 60 "N".


This is the SAM file of the mapping:

left_1 0 myseq_contig1 1 60 100M * 0 0 ACGTACGACTGACTGAGCGGGCGCATCTATTGCATTCGATCGGAACGATCGATCGACAGCGACTAGCATCCCTCGGAAACTCTTAATGGCACATAAACGA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:30 AS:i:30 XS:i:0
left_2 0 myseq_contig1 1 60 100M * 0 0 ACGTACGACTGACTGAGCGGGCGAATCTATTGCATTCGATTGGAACGATCGATCGAGAGCGACTAGCATCCCTCGGAAACTCTTAATGGCACATAAACGA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:30 AS:i:30 XS:i:0
read1_101 0 myseq_contig1 71 60 100M * 0 0 CCTCGGAAACTCTTAATGGCACATAAACGACCGAATCAATGACTCACCTTTGTCCGAGTCTCTAACCGTTTAAGGTGAATTAACATACCGGGGAGGTGAA JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
read21_121 0 myseq_contig1 91 60 100M * 0 0 ACATAAACGACCGAATCAATGACTCACCTTTGTCCGAGTCTCTAACCGTTTAAGGTGAATTAACATACCGGGGAGGTGAATCCAAGGTCTCAGACCACTT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
read41_141 0 myseq_contig1 111 60 100M * 0 0 GACTCACCTTTGTCCGAGTCTCTAACCGTTTAAGGTGAATTAACATACCGGGGAGGTGAATCCAAGGTCTCAGACCACTTCGGACCGAGATGCTTCGCAT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:100 AS:i:100 XS:i:0
right1 0 myseq_contig1 171 60 100M * 0 0 TCCAAGGTCTCAGACCACTTCGGACCGAGATGCTTCGCATCCGAGTATCTCGGCAAAGTTATAGGCGTGGGGGCTACGCTCCAAGGCGTTATTTACTTAT JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 MD:Z:40 AS:i:40 XS:i:0

I have 3 reads that form a contig at the first non-"N" position in the reference, their read names are read1_101, read21_121 and read41_141.

The reads left_1 and left_2 both start at the first "N". They also both perfectly match the first 30 non-"N"-sites in the reference, i.e. they both span the "N"-region at the beginning and then reach into the non-"N"-sequence-part.

The last read (right1) starts at position 171, which is 40 positions before the trailing "N"-region. It reaches until the end, i.e. until the last "N" in the reference.


Now I am confused about the output of this command:

samtools view -h mapping.sam | samtools mpileup -vf ref.fasta - | bcftools call -Mc -Oz - | bcftools view | less -S

The output looks like this (cut-out because my post is too long):

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    -
myseq_contig1    211    .    N    C    6.98265    .    DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9912    GT:PL    1/1:36,3,0
myseq_contig1    212    .    N    C    4.77219    .    DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9923    GT:PL    0/1:33,3,0
myseq_contig1    213    .    N    G    3.54557    .    DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9935    GT:PL    0/1:31,3,0
myseq_contig1    214    .    N    A    3.01618    .    DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9944    GT:PL    0/1:30,3,0
myseq_contig1    215    .    N    .    3.53256    .    DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.9956    GT:PL    0/0:29
myseq_contig1    216    .    N    .    4.11733    .    DP=1;SGB=-0.379885;MQ0F=0;AF1=1;AC1=2;DP4=0,0,1,0;MQ=60;FQ=-29.997    GT:PL    0/0:28

Now I see the following genotypes (last column, column header = "-") at the respective reference-sites:

211: 1/1
212: 0/1
213: 0/1
214: 0/1

211 is where the "N" region starts at the end of the reference. So here he calls the variant 1.

But from position 215 onwards he starts to call 0/0 (which is reference which is "N") until the end.

But positions 215 until end vs positions 211-214 share the same condition: reference is "N" and there is one read (namely "right1") that covers the position. Why does he make different decisions when the conditions are the same?

call bcftools • 2.7k views
ADD COMMENT

Login before adding your answer.

Traffic: 1988 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6