Hey guys,
So, I have some contigs constructed from illumina paired-reads (with ABySS) that did not map to our reference genomic sequence, which was supposed to be the only thing in our sample. About half the reads did not map and we sequenced to a high depth. I want to find out which of these contigs are actually real.
My thought is to map the reads back to the contigs with bowtie2 and determine from the mapping data which are the most supported contigs. I already looked at how many reads mapped to each contig but I realized that didn't tell me enough information. I would like to determine support for a contig based on how many read pairs mapped concordantly and with the correct insert size. How can I do this procedurally? What should the formula look like for generating a quantitative measure of support?
Open to ideas other ideas, too.
Thanks!
Usually you can trust assemblers. They won't assemble contigs from nowhere. As Istvan said, searching against nt is a necessary step. A lot of sequences in nt are not put into the reference assembly. Nt also helps to identify microbiome contamination. If you are working on a model organism, also run repeatmasker. At least for humans, these extra contigs tend to be diverged copies of repeats.