I have paired-end Illumina reads from several BACs of an unsequenced plant species. After assembling them using SOAPdenovo, I realized the total assembled size for each BAC ranged from 1M to 2M approximately (way too much). So, I blasted some scaffolfs and finally, I concluded that they sequenced the whole BACs (100% identity with E. coli).
I thought two ways to handle this:
Align reads over genome of E. coli and downloaded BAC sequences from NCBI? And take only the unaligned ones to perform assembly de novo.
Perform assembly de novo of all reads, and make contigs and scaffolds. Then I'd put all scaffolds of all BACs together into a fasta file and I'd remove redundancy with any tool (e.g. CAP3... Do you know any other and better tool?) in order to decrease the time needed to blast. Scaffolds with hits would be discarded.
It is not likely that you sequenced the whole clone unless you skipped the step of the BAC prep where you isolate the insert with a digest. Though it is a odd to see that size assembly from a BAC, you probably just sequenced a lot of the clone, as well. In my experience, it is very common to see lots of contamination from the clone in your reads, so nothing to worry about here. My advice is to obtain the clone sequence used to construct the BAC library. This can be used for screening.
The next thing to do is screen your reads, and I would personally not use an aligner for this, BLAST seems to be more sensitive. You can certainly screen reads with an alignment approach, but in my tests I always still ended up with large chunks of clone DNA in my assemblies.
I would strongly advise against assembling the raw reads and then trying to remove the contaminants. You will not simply end up with contigs from the clone, which would be easy to remove. Instead, there will be large clone-derived stretches of DNA in the middle of contigs and this won't be easy or straightforward to fix. Better to just screen your reads, then assemble.
ADD COMMENT
• link
updated 2.8 years ago by
Ram
44k
•
written 10.0 years ago by
SES
8.6k
You need to remove the reads coming from the BAC vector backbone and E.coli genome. You can use a program like Deconseq. Assemble only after removing these reads. I agree with @SES, as assembling with the contamination gives misassemblies.
ADD COMMENT
• link
updated 2.8 years ago by
Ram
44k
•
written 10.0 years ago by
Shyam
▴
150