I was provided with sequencing data of a single protist genome. I used Jellyfish (-k 21 -m 35M) on the Trimmomatic libraries to estimate the approximate genome size. However, the analysis with the help of GenomeScope failed, and I get these results: http://genomescope.org/analysis.php?code=u2FlyNR00NjbjMAkpSUG. I am interested in answers to three questions: 1) Does it make sense to continue working with such data? 2) what could cause such results? 3) Can you advise an effective way to get rid of bacterial contamination if specific sources of contamination are unknown? Unfortunately, direct comparison with the NCBI nr database is extremely time consuming, calculated in weeks.
I will be grateful for any help or advice
1) depending on the purposes, yes it makes sense to continue with it
2) you could give other kmer sizes a try, depending on the sequencing depth & genome size other kmer values might work better
3) You could try to use Kraken or soft like that for filtering the reads. Otherwise I would assemble it all and filter the assembled contigs afterwards for the contaminations
If you are estimating genome size, I assume you want to use this data to assemble a genome. With what you have shown alone it is not possible to answer for sure, but I tend to believe it is worth to use the data. Did you perform other quality checks on the data?
I think bad sequencing and insufficient coverage could cause this. Maybe a genome with lots of repeats with different levels of similarity could cause this, but this is just a wild guess.
I agree with lieven.sterck in both his suggestions: Kraken could be used to filter the reads prior to assembly, but I think filtering after assembly is better. I like BlobTools for post-assembly filtering.
1) depending on the purposes, yes it makes sense to continue with it
2) you could give other kmer sizes a try, depending on the sequencing depth & genome size other kmer values might work better
3) You could try to use Kraken or soft like that for filtering the reads. Otherwise I would assemble it all and filter the assembled contigs afterwards for the contaminations
Dear colleagues, I am very grateful to you for the answers. Thank you so much for Kraken and BlobTools, this is exactly what I wanted to find.