Following alignment of Paired-End DNA reads to the human genome, I am wondering what sort of filters should generally be applied to the alignments?
I know that duplicates should be removed as they are likely PCR artifacts (possible with SAMTools).
Can anyone outline other important criteria to filter on and perhaps suggest a filter threshold? I have heard mention of removing any reads with more than one alignment - I am not sure if this is overkill though.
If it helps, I plan to align 100bp PE Illumina reads to the human genome using Bfast and the 10 indexes recommended in the publication. My application will be targeted sequencing of approx 1000 genes followed by SNP/Indel analysis to look for association with a given phenotype.
Thanks in advance.
The choice of filtering depends on the nature of samples and hypotheses that you are testing. For example if the regions of interest are in repetitive region then removing reads with multiple alignments makes no sense etc. so you should frame your question in the term of biological question rather than a generic how do I filter reads
Question has been edited to include my application.