Question

Filtering Ngs Genomic Alignments

4

Entering edit mode

13.8 years ago

Travis ★ 2.8k

Following alignment of Paired-End DNA reads to the human genome, I am wondering what sort of filters should generally be applied to the alignments?

I know that duplicates should be removed as they are likely PCR artifacts (possible with SAMTools).

Can anyone outline other important criteria to filter on and perhaps suggest a filter threshold? I have heard mention of removing any reads with more than one alignment - I am not sure if this is overkill though.

If it helps, I plan to align 100bp PE Illumina reads to the human genome using Bfast and the 10 indexes recommended in the publication. My application will be targeted sequencing of approx 1000 genes followed by SNP/Indel analysis to look for association with a given phenotype.

Thanks in advance.

next-gen sequencing alignment filter paired • 4.9k views

ADD COMMENT • link updated 13.0 years ago by John St. John ★ 1.2k • written 13.8 years ago by Travis ★ 2.8k

3

Entering edit mode

The choice of filtering depends on the nature of samples and hypotheses that you are testing. For example if the regions of interest are in repetitive region then removing reads with multiple alignments makes no sense etc. so you should frame your question in the term of biological question rather than a generic how do I filter reads

ADD REPLY • link 13.8 years ago by Istvan Albert 102k

0

Entering edit mode

Question has been edited to include my application.

ADD REPLY • link 13.8 years ago by Travis ★ 2.8k

Ram · Answer 1 · 2011-07-07

You could always follow the GATK best practices for this kind of stuff. Or check out the supplementary material for a paper like the 1000 genomes project pilot.

One thing I would add that I am not sure is discussed in either though is to check for illumina adapter sequences and trim those from your data. I don't know how much of an issue variant callers have with adapter contamination, but I have seen it sneak into some published genome databases. You can find some of Illumina's adapter sequences posted online, but I haven't had luck finding the multiplexed adapter sequences online. If you write to them though they will send you a letter with all of the current sequences, and then it is up to you to determine which ones could be in your reads and remove them. There are some programs out there to do that, but I think those all work directly on the fastq reads rather than the alignment to the genome.

Here is the 1000 genomes pilot project paper: http://dx.doi.org/doi:10.1038/nature09534

And here are Broad's GATK best practices recommendations: http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2