I have genomic resequencing medical exome data (~4500) sequenced on an ion torrent. In terms of alignment, besides TMAP, any thoughts on samtools, bwa-mem, bowtie2, novalign, or other alignment software. Thank you :).
I have genomic resequencing medical exome data (~4500) sequenced on an ion torrent. In terms of alignment, besides TMAP, any thoughts on samtools, bwa-mem, bowtie2, novalign, or other alignment software. Thank you :).
I've never dealt with Ion data, but from what I've read its main sequencing errors are indels (if not "main", at least they are common, unlike Illumina data). So you have to use a mapper which allows for indels, and maybe tweak the parameters to decrease gap penalty, and most probably realign later. There are some tools, such as PyroTools, and also protocols, specific for 454 / Ion data. This paper compare mappers on Ion data, against bacterial genomes though (spoiler: there is no clear "overall best mapper"). Finally, BBMap seems like a good fit as well.
I have an extensive experience dealing with Ion Torrent data and it is true that reads show high rate of homopolymer errors as suggested by h.mom. For some samples, 30% of reads (reference RNA-seq) require an indel to align against the reference genome. Ion proton system can be considered as a fancy pH meter that detects release of protons and decides which nucelotide has been added based on numbers of protons released. In the region where you have repeats of the same nucelotide (for example AAAAA), it is sometimes hard for it to resolve and it over or underestimates the real count. As a result, such reads need to be aligned using insertion or deletion depending on if the sequences over or under estimated the number of bases. I would avoid changing the scoring scheme of the alignment. For alignment, you should increase the edit distance because of the homopolymer errors and also because of the fact that reads are loner (around 150 bp) in length. You should also increase maximum insertions or deletions allowed in a read. Also increase the length of the biggest gap allowed. This has helped me.
Please feel free to email me at ashutoshmits at gmail. We have mostly used it for RNA-seq so the tricks that I talked above worked because our goal was to increase the mapping efficiency. I am not sure how allowing more more errors during alignment to increase the alignment rate would affect the downstream variant calling results. In our case we only use uniquely aligned reads for quantification of expression. The good part with longer reads is that even if you are little liberal with alignment you may still be able to align reads uniquely. I think you will have to perform a thorough filtering on your vcf files. I may or may not be making sense right now but we can talk about it over email.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Bowtie2 or bwa-mem.
Thank you :)
My very first thought is SAMtools sucks at performing alignments.