Lots of alleles in the same place?
1
0
Entering edit mode
5.6 years ago
tbb21 ▴ 10

I have short (20 bp) sequencing reads for DNA that has been in e. coli undergoing different rates of SOS response (inducing mutations in the DNA). When I use bcftools to do variant calling of these reads, mapping them back to the ~10,000 original 20 bp long DNA sequences, I seem to get variants at particular positions but lots of identical variants at the same position. Ie. the mutations seem to all be occurring in the same positions and not distributed in any way.

What am I doing or interpreting wrong? Here is my alignment and variant calling code. In it I align with bowtie, then view and sort with SAM tools, then use bcftools mpileup and bcftools call before filtering and producing a .vcf file.

bowtie2 -p 10 --end-to-end --very-sensitive -3 1 -I 0 -X 200 -x <bowtie2 index file path> -1 <1st paired end read data> -2 <2nd paired read data> | samtools view -bSu - | samtools sort - | bcftools mpileup -d 4000000 -q 0 -Q 20 -Ou -f AM_FullLibrary.fa - --threads 10 | bcftools call -Ou -m -v | bcftools filter -Ov -s LowQual -e '%QUAL<20' > temp.vcf

Opening my VCF file in python using Pandas, I first remove anything that didnt pass the filter and then look at the DP4 information field and sum up the numbers in the 3rd and 4th entries which should correspond to non-reference alignments at this position right?

When I do this I will get that reference DNA example 'a' had a substitution at position 18 from A to T, and the INFO column has:

"DP=50;VDB=6.99472e-07;SGB=-0.670168;RPB=1;MQB=0.458935;BQB=0.000592503;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=15,0,10,0;MQ=18'"

So here I interpret this as there being that: out of all my sequencing data, there were 15 reads that passed the filter (first entry in DP4) that were reference aligned (no mutations) at this position. And then there were 10 reads that had the substitution at this position? But for this DNA example 'a' the only variant I am getting is at this position! Surely the mutations for example 'a' should be more uniformly distributed rather than all 10 occurring in the same position?

Any help or insight is most appreciated!

variants alleles snps bcftools • 1.3k views
ADD COMMENT
0
Entering edit mode

In addition, some of the DP4 entries look like: 0,0,8,0 does this mean that at this position none of the non mutant alignments were of a high enough quality to pass the filter and only variants were?

ADD REPLY
0
Entering edit mode
5.6 years ago

Correct, there are:

  • 15 'high quality' forward reads with the REF allele
  • 10 'high quality' forward reads with the ALT allele

The total position read depth (DP) is 50, meaning that 50 - (15 + 10) = 25 reads were not included during variant calling due to 'low quality'.

Here is the definition of DP4 from SAMtools:

Number of 1) forward ref alleles; 2) reverse ref; 3) forward non-ref; 4) reverse non-ref alleles, used in variant calling. Sum can be smaller than DP because low-quality bases are not counted.

I note that the mapping quality (MAPQ / MQ) for that position is quite low, with a value of 18, which equates to ~16 in 1000 chance of being mis-alignment.

Check your VCF header to see to what all of the other tags relate.

-------------------------------

0,0,8,0 implies (to me) a homozygous variant call. You should check the value of DP for this position, too, and view the alignments in your genome viewer of choice (e.g. IGV).

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 1512 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6