Hi there,
I have some 16 S Amplicon seq data and sequenced roughly 460-480 bp amplicons using 2x300 MiSeq reads. Usually, I use flash2 to merge the reads after quality trimming. However, recently I stumbled across this protocol from Fred Mahé, which suggests to merge reads before trimming. This sort of made sense to me as trimming effects mostly the overlapping ends. However, the difference surprised me nevertheless, and in some cases doubled the number of reads that were merged.
Now I'm trying to understand some pieces of the puzzle in more detail - one is this example (end of R1 on top, R2 below):
AAAGAGGAATTGCTCCAGGAGGGGTCTGCGTCCGATTAGCTGGATGGTGGGGTAAGGGCCTACCATGGCGATGATCGGTAGCTGGTCGGAGAGGATGAGCAGCC
AGGAATTGCTCCAGGAGGGGTCGGCGTCCGATTAGCGGGATGGTGGGGTAAAGGCCTACCATGGCGATGATCGGTAGCTGGTCTGAGAGGATGATCAGCCACA
^
Where the respective quality string are (Illumina 1.8+ Phred+33)
EGGGGGGFFD?<FFGFGGGGGGGGGGFGGGCEDDCE5CFFGFF8FDFC?EEG@DF*6>3:)23>79FB?FFB:?4631?BFBFGF(-0435(.48:?0)-<C?0
########################################??>::4360*).02?C4@7:5(1?79FD83<D@=36GGGDGFF:;22)AGFGFF;EF>GAC=AE
^
Based on typical Phred+33 scores, the .
is better than *
at the exact position, so in untrimmed sequences I get an A
call at the position - when merging the trimmed reads, I removed the R2 end because of it's general low quality and hence retain the G
. This sequence section covers a hypervariable region and single nucleotide differences in the right position can lead to drastic changes in the classification.
Would you trust the higher quality single base call in a lower quality stretch more than the low quality base call in an otherwise higher quality region?
As I never done this kind of process, this will just be my personal advise :
I would take a look at the neighbour qualities of your base of interest (5 bases on each side)
R1 got a better quality mean overall but the machine is really not sure about this 'G' whereas it calls with high score the neighbouring bases. I would not trust this base for classification
R2 got heterogeous scores from good to very bad call, your 'A' base is 13 Q-score which is not that good 1 in 30 to be a bad call...
I won't believe either of the two bases for the classification. Anyhow I guess your classification is mostly based on base qualities so the soft will not used these bad bases.
hm, matches my gut feeling.
unfortunately not - it's based on the consensus sequence. I convert to fasta. Maybe I include a conversion to ambiguous nucleotides for low quality sequences before conversion to fasta
Transform base to N if both quality below a specific threshold could be an idea