Hi All,
I have MiSeq PE250 reads for a viral vector sample. I trimmed the raw reads with Trim Galore (length >= 200 and Q >= 30). I aligned these reads using both BWA MEM
and Bowtie2 --local
and found a AAA -> TTT
variant in both alignment files at different frequencies 9%
and 71%
respectively. I further analysed reads containing variant and found that the reads with variant had base quality call score less than 30 for TTT bases.
To check the effect of trimming on variant call and coverage, I further trimmed the trim galore trimmed reads to remove all reads which had even a single base call quality less than 30
. I repaired these reads using bbmap repair.sh
and aligned again using bowtie2. For this alignment the above variant was found at <10%
variant frequency. This was consistent with variant frequency reported with BWA-MEM. The coverage was also affected with almost 6.7%
of the bases with coverage less than 10X
for the super trimmed reads vs 0.1%
for trim galore trimmed only reads.
I used freebayes for variant calling with same parameters (Min Coverage 10, Min Alternate Fraction 0.01, Min Alternate Count 4) for both BWA and Bowtie2 aligned reads.
- Why was there such a large difference for variant frequency for Bowtie2 and BWA MEM?
- The coverage difference between trim galore trimmed reads and super trimmed reads is vast. Should I still use super trimmed reads (Trim Galore + Further trimming) as it gives correct variant frequency.
The fastqc result for the trim galore trimmed reads and super trimmed reads were as follows:
Super Trimmed Reads
The fastqc per base faultily fails for the super trimmed reads indicating a wrong Illumian Phred Score encoding version. Why did this happen?
Thanks!
That is a strange result. It looks like you took out all of the good data and left bad data behind?? Trimming at Q30 or more is pretty severe. Can you not filter your SNP's afterwards if you want to be stringent?
Hi Genomax,
I checked again to confirm only reads with bases having Quality greater than 30 were selected. So I am confused about the fastqc results as well. I understand trimming is severe but less stringent trimming is leading to a false variant call at much higher frequency.
Can you please refer any peer reviewed papers which perform NGS analysis for viral vector samples especially SNP /INDEL analysis? This will help me try to streamline the analysis process.
You probably filtered out so many reads with the super trimmed such that fastqc could not accurately detect the encoding. I am not even sure why you are trimming your reads twice - you are likely throwing away too much information.
You should trim once only and it is fine to set cut-offs like:
NB -
BWA mem
is designed for reads >70bpwhat's the command you used for trimming?
I first performed trimming with trim galore
Then I trimmed these reads using the following command for trimmed.R1.fq and trimmed.R2.fq
https://stackoverflow.com/questions/55459982/remove-reads-that-have-low-quality-base-calls/55461995#55461995
I used repair.sh from BBmap to repair the reads.
With this command 1/4th of the reads were removed from trim galore trimmed reads.
I am not sure why you don't do all of this in BBMap suite? You can trim/align and keep all reads properly paired without leaving the suite. Filter you SNP's afterwards at level you wish. Only outline command provided below. Look at bbduk and bbmap guides for additional assistance.