Hello,
i guess this will be more of an statistical question, if it violates the rules of here, i can also post it somewhere else.
So i am doing amplicon seq analysis with Illumina reads. The reads are sequenced paired end. The issue that i have is than the reverse reads are getting really short (25% of original length, when trimmed from the outside with a minimal phred score of 30. Now the reads are only hardly overlapping.
When trimmming with a minimal phred score of 25, the problem dissolves.
Also, i am looking for SNPs.
Now the point what i am asking is, how do i calculate the sufficient(what ever that means?) sequencing depth to distiguish SNPs from sequencing errors, regarding the change in the minimal phred score.
For example a minimal phred score of 25 means the basecalling error will be maximum 0.3%. Now what sequencing depth do i need to have a statistical difference between SNP and sequencing error?
Another solution would be to only keep the SNPs, where the alternative is above 0.3% and say a depth of 100 of this SNP should be allright (i think in a publication it won´t be asked but i just want to understand for myself).
Thanks a lot
p.s. i would be happy if you could share any literature about this :)
From a paired-end perspective, you do not want your reads to overlap from the same DNA fragment. This is because you are just replicating information of one position and wasting reads. Read overlap from a paired-end read set usually happens when you have shorter insert size than your read length. You should technically have larger insert size than your read length for avoiding this to happen. For SNP discovery, a 30x depth of coverage is the best you should go for, but you can also settle for less like 15x or 20x (don't go for 10x if you are working with a novel organism).
From a SNP discovery perspective, read quality trimming can be avoided (unless you have very low mapping rate) and only remove adapters. For distinguishing real SNPs (or single nucleotide variants/SNVs) from sequencing errors, variant discovery tools take base quality of each base into account (along with mapping quality of the whole read itself) to give you phred scaled variant quality score along with a phred scaled genotype quality score which should be used to prioritise variants of interest. This is of course superficial, and variant filtration is a relatively complex task in itself. Thus, the variant caller will not use bases from the reads which have a lower score than the minimum base quality score specified by the caller (its usually around 10 or 20, but can be changed) to call variants.