Hi,
I am doing a de novo assembly and I have contamination with Saccharomyces. I cleaned the raw reads (paired reads 150nt) using BBsplit, minimum identity 0.95, but now that I have repeated the assembly I have checked and I still have several contigs which match perfectly or almost perfect with Saccharomyces. What should I do? Repeat the cleaning in the raw reads with lower minimum identity? Cleaning the merged reads? Cleaning the assembly? Thanks!
I just received the RNA, the people who took the tissue samples extracted the RNA with the same equipment and lab that other people working with bacteria, so I suppose it got contaminated there, but some samples are more contaminated than others, I didn't realize until I finished the assembly, did DESeq and Saccharomyces genes pop up as differentially expressed genes :)
Ah - if this is RNA-seq, that changes things. When you have introns and you are mapping RNA-seq reads to a genome, leave maxindel at default and probably minid at default, as well, or the reads spanning introns won't map.
Saccharomyces =/= bacteria (it is a yeast/eukaryote). So it is possible that you don't have any contamination in your data but are seeing conserved genes. I would be very surprised if the people who did the RNA were so careless that a completely different species contaminated your samples.
Here is a test: If you were to do the blast search against a different genome - I assume you used BLAST at NCBI - Enter this in the "Organism" field as test (Drosophila melanogaster (taxid:7227)) and see if you get hits to fruit fly. Saccharomyces may have been the first in the db so it showed up first.
"I would be very surprised if the people who did the RNA were so careless that a completely different species contaminated your samples".
I wouldn't :P . We've had carry over of DNA on the MiSeq before and I was having a discussion just this morning about the frequency of contaminated assemblies. Yeasts are natural commensals so could have come from the people/air/surfaces/equipment/lab - you name it...
Yeah we've since resolved it. It wasn't presenting an issue for our assemblies until we started working on aDNA samples and we're seeing suspect reads in Kraken. We now know that it can cause issues so are on the lookout for it.
Is the real genome you are working with very distinct from S. cerevisiae? I am going to guess that it is not because otherwise bbsplit should have worked as intended.
Can you provide the bbsplit command you used? What program did you use for the assembly? Also provide some additional information about several contigs. Is that 1%, 10% what fraction of the total? I will add "bbmap" tag to your post so @Brian will be sure to see this.
I'm not sure whether this would make a radical difference or not, but can you try the approach from the other side? Map all your reads to Saccharomyces, keeping the unmapped reads which should be fish-specific, and then see what you're left with?
Ah ok, I've done it 'manually' in the past with samtools and bwa when I was trying to assemble plasmids from illumina data of lab strains. Only managed to get it to assemble into 1 perfect contig after discarding fragments of the host gDNA in the miniprep by alignment. Never actually used the BBmap suite. Might have to investigate it further.
BBSplit is excellent at binning reads to multiple genome references. BBTools are easy to use and only thing you need is Java v.1.7 and above. There is no installation needed. Download, uncompress and use. Many bbtools are multi-threaded, efficiently written and handle data sets of almost any size.
That approach works fine when the two organisms have zero homology, but that is rarely the case in practice - particularly when you consider low-complexity sequence (e.g. ATATATATATAT...) which might be present in anything. It is typically better to not remove contamination at all prior to assembly, and just try to do so afterward on the contigs, than to remove it overly-aggressively by discarding all reads that map to the reference of the unmasked suspected contaminant - at least, when working with relatively short (100-150bp) reads.
Of course, the approach becomes safer the more distantly related the organisms are, and the smaller they are; so if you're working with bacteria from different phyla, it's much less likely to cause gaps in your assembly compared to two eukaryotes.
What is the difference in GC content between cerevisiae and your fish organism? If it is clearly divided, you can also divide the scaffolds after assembly depending on the GC content.
I did the same with 22 samples.
Then, I merged the filtered reads using FLASH and I assembled the merged (57-62%) and the non-mergable reads using the CLC Assembly cell. This lead me to up to 764,000 contigs, which I blasted against Sacharomyces cerivisae genome and I got between 300 and 600 hits (lowest e-value 0 ) per saccharomyces chromosome (16).
Hmm. Even at 600 that is 0.08% of total that show a hit to S. cerevisiae. They could be genes that are evolutionarily conserved (or things that bbsplit may have let through since you did not explicitly set how ambiguous mappings were to be handled). What types of hits are you seeing among those 600?
bbmerge.sh is probably a better option for merging instead of FLASH. Any reason why you chose FLASH over that?
If you have low-quality reads, they can still assemble even though they do not map to the reference at 0.95 identity.
But to be clear, in this case you're not exactly using BBSplit as designed; you're supposed to use it with multiple references. If you only have one reference, use BBMap. Also, for decontamination, if you want a high threshold like minid=0.95, it's useful to use something like "qtrim=rl trimq=15 untrim". This will quality-trim reads prior to mapping, then undo the trimming after mapping. As a result, for low quality reads that are not 95% identity to your contaminant due to errors, those will still get mapped. But I think minid=0.95 is pretty high for organisms in different kingdoms; I usually use 0.9. In summary, I suggest this command:
Ah - good question; I've never used CLC. With Spades, Ray, and Tadpole, assembling after running BBMerge yield a superior assembly, in my testing. With Soap Denovo and its successor Megahit, inferior. These results were repeatable across many libraries and organisms. Thus, I can only recommend that for CLC, the test is done both ways, since whether merging is beneficial varies by assembler. Note that in all cases you should assemble with both the merged and unmerged reads.
According to FastQC my reads are very good quality, but in any case, I'm going to try BBMap with stringent parameters (0.90) and let's see.
I have checked other samples from the same fish species that are not supposed to be contaminated and I have more less the same Saccharomyces hits, so...they might be very conserved genes....Isn't weird that if those reads are corresponding to very conserved genes the only hit that BLAST gives is Saccharomyces cerevisiae? I mean, if its very conserved it should give me lots of organisms as hits, right? BLAST just gives me Sacharomyces cerevisiae, the name of the strain and the chromosome, it does not give the the gene name...im confused. But thanks! I will try BBMap
Be sure to take this post into account. If you are mapping RNA-seq data to a genome, you need more relaxed mapping criteria to allow for the introns. I'd recommend default minid and default maxindel.
map your reads against your fish genome (bowtie2,...) then then keep aligned reads using samtools -F4 option.
You can also re-map the reads obtains against Saccharomyces to see if there is homology
Just curious. How did the fish data get contaminated with yeast in the first place?
I just received the RNA, the people who took the tissue samples extracted the RNA with the same equipment and lab that other people working with bacteria, so I suppose it got contaminated there, but some samples are more contaminated than others, I didn't realize until I finished the assembly, did DESeq and Saccharomyces genes pop up as differentially expressed genes :)
Ah - if this is RNA-seq, that changes things. When you have introns and you are mapping RNA-seq reads to a genome, leave maxindel at default and probably minid at default, as well, or the reads spanning introns won't map.
Saccharomyces =/= bacteria (it is a yeast/eukaryote). So it is possible that you don't have any contamination in your data but are seeing conserved genes. I would be very surprised if the people who did the RNA were so careless that a completely different species contaminated your samples.
Here is a test: If you were to do the blast search against a different genome - I assume you used BLAST at NCBI - Enter this in the "Organism" field as test (Drosophila melanogaster (taxid:7227)) and see if you get hits to fruit fly. Saccharomyces may have been the first in the db so it showed up first.
"I would be very surprised if the people who did the RNA were so careless that a completely different species contaminated your samples".
I wouldn't :P . We've had carry over of DNA on the MiSeq before and I was having a discussion just this morning about the frequency of contaminated assemblies. Yeasts are natural commensals so could have come from the people/air/surfaces/equipment/lab - you name it...
I recommend bleach wash for carryover contamination if you're not doing it already...
Yeah we've since resolved it. It wasn't presenting an issue for our assemblies until we started working on aDNA samples and we're seeing suspect reads in Kraken. We now know that it can cause issues so are on the lookout for it.
I'm not proposing this as an answer but its worth ruling out at least
"Current ecological understanding of fungal-like pathogens of fish: what lies beneath?"
https://www.ncbi.nlm.nih.gov/pubmed/24600442/
Is the real genome you are working with very distinct from S. cerevisiae? I am going to guess that it is not because otherwise
bbsplit
should have worked as intended.Yes, is a fish trancriptome
Can you provide the bbsplit command you used? What program did you use for the assembly? Also provide some additional information about
several contigs
. Is that 1%, 10% what fraction of the total? I will add "bbmap" tag to your post so @Brian will be sure to see this.I'm not sure whether this would make a radical difference or not, but can you try the approach from the other side? Map all your reads to Saccharomyces, keeping the unmapped reads which should be fish-specific, and then see what you're left with?
That is
sorta kinda
what @elena did when using BBsplit with just the yeast genome.Ah ok, I've done it 'manually' in the past with
samtools
andbwa
when I was trying to assemble plasmids from illumina data of lab strains. Only managed to get it to assemble into 1 perfect contig after discarding fragments of the host gDNA in the miniprep by alignment. Never actually used theBBmap
suite. Might have to investigate it further.BBSplit is excellent at binning reads to multiple genome references. BBTools are easy to use and only thing you need is Java v.1.7 and above. There is no installation needed. Download, uncompress and use. Many bbtools are multi-threaded, efficiently written and handle data sets of almost any size.
That approach works fine when the two organisms have zero homology, but that is rarely the case in practice - particularly when you consider low-complexity sequence (e.g. ATATATATATAT...) which might be present in anything. It is typically better to not remove contamination at all prior to assembly, and just try to do so afterward on the contigs, than to remove it overly-aggressively by discarding all reads that map to the reference of the unmasked suspected contaminant - at least, when working with relatively short (100-150bp) reads.
Of course, the approach becomes safer the more distantly related the organisms are, and the smaller they are; so if you're working with bacteria from different phyla, it's much less likely to cause gaps in your assembly compared to two eukaryotes.
What is the difference in GC content between cerevisiae and your fish organism? If it is clearly divided, you can also divide the scaffolds after assembly depending on the GC content.
Yes, this is the command I used for BBslit:
I did the same with 22 samples. Then, I merged the filtered reads using FLASH and I assembled the merged (57-62%) and the non-mergable reads using the CLC Assembly cell. This lead me to up to 764,000 contigs, which I blasted against Sacharomyces cerivisae genome and I got between 300 and 600 hits (lowest e-value 0 ) per saccharomyces chromosome (16).
Hmm. Even at 600 that is 0.08% of total that show a hit to S. cerevisiae. They could be genes that are evolutionarily conserved (or things that bbsplit may have let through since you did not explicitly set how ambiguous mappings were to be handled). What types of hits are you seeing among those 600?
bbmerge.sh
is probably a better option for merging instead of FLASH. Any reason why you chose FLASH over that?