Hi,
Recently I'm working on a de novo genome assembly project. Because the animal we study is so tiny, we had to pool DNA together from multiple individuals. And we've sequenced these DNA using both PacBio and Illumina. I've assemble the PacBio long reads into contigs, and want to do scaffolding and error correction using short reads. But I have the following two concern:
Can I use these short reads to correct the assembly? Since mixed DNA samples were used, how do the error correction tools discriminate "errors" or "individual difference"?
If I can do error correction, which one shoud be done first, error correction or scaffolding? I don't know what's the difference.
I have very little experience in this field. Sorry if the question is a bit basic. I'm totally stuck. Any help would be greatly appreciated.
Thank you for your answer! It's very clear. Now I can proceed. Thank you :)
I've got another question. Maybe I should open a new thread, but you may know the circumstances better. You know, because of the tiny sizes, we used the whole bodies for sequencing, so there are considerable sequence contamination in both PacBio and Illumina data. What are your suggestions about this? Because the estimated coverage of PacBio data is more than 50X, I used long reads to assemble a contig. Then do scaffolding using mate pair data based on this backbone. This is my plan, and now I don't know how to deal with the contamination.
My general idea about this is as follows. First, use (all) raw PacBio data to assemble a contig, and mask those (possible) contaminated regions with 'N', which can be done by Kraken. Second, filter Illumina mate pair data, for both quanlity and contamination. Third, (polish and) scaffold. Do you think it's a good way? I reallly appreciate your replys; they are very helpful. Sorry for my poor English.
Yes, that is a valid approach.
Concerning the contamination: Not sure how well kraken will work on long reads (or contigs)? regardless of the approach I would indeed also try to remove (in stead of masking with Ns) the contamination on contig level.
I found more than 70% contigs were classified into contaminated sequences, but most of them only had a small part (several k-mers) contaminated. There would be a few sequences left if removing the whole contigs, so I want to mask the contaminated regions with 'N'. Maybe I shoud try remove contamination using raw PacBio data. I will try both. Thank you for your reply!
Unless your assembly is massively making chimeric contigs this should not happen. Usually (the "theory") the whole bit of DNA is contamination or it is not.
I start to suspect that perhaps your contamination criteria are too lenient and you're ending up with lots of false positives.
Are the illumina data from the same biological samples as the PacBio? If so you could check if you have the same level of contamination in the illumina data as in the pacbio contigs
The latest version of Kraken with default parameters was used, and the assembly was generated by Canu. According to the results of Kraken, about 75% contigs were classified into archaea, bacteria, or viral. I was also a bit shocked by this. The contaminated regions accounted for about 0.65% of the total lengh (8371218/1281768139).
I will run Kraken using PacBio data. But the PacBio data and the Illumina data were from different batches of insects, both batches contained a dozen bugs.
I'm not very familiar with the output of Kraken but the 75% does not look to align with the 0,6% of the total length (or it should be all the very small contigs?) .
Not sure it it's possible but from what you write I suggest you do some post kraken filtering and remove only those contigs that for the biggest part are marked as contamination. It looks to me like there might be plenty of contigs that only have a small number of bases assigned to be contamination. If only 50bp on a contig of 100kb are reported by kraken I would not consider that contig to be a contamination. Remember that also genuine eukaryote contigs can show some similarity to non-eukaryote stuff
Yes, I agree with you. Because I need to set a threshold to filter out contaminated contigs, I prefer to filter on the raw data now. Thank you very much~