Hello everyone!
I am new to bioinformatics and have never faced such a problem, but now I am working with a dataset GSE172189 that appears to be contaminated by bacteria (only ~60% of reads are aligned with Salmon, and taxonomy analysis on SRA says there are up to 25% reads coming from bacteria). I wanted to detect contaminating species to finally get rid of them. In such a case, it seems I could take most presented organisms in SRA Taxonomy Analysis, download their genomes and do BBSplit on them. But I am not sure it is as effective as I want, and in general case I am not aware of contaminating species at all. How do you proceed in this case? Is there a tool that helps to BLAST reads on multiple organisms and get frequency statistics?
Thank you in advance!
You could use the reads that
salmon
was able to assign and ignore the rest.NCBI uses a tool called
STAT
(LINK) for the taxonomy results they show in SRA.Thank you! According to the dataset description, reads also seem to contain UMI, that's why I am not sure that all unaligned reads are coming from contamination. My intention was to first get rid of contaminated reads, and then deal with UMIs.
Which exact sample out of GSE172189 are you referring to? I checked a couple and they all seem to say 99% Euk. I also don't see any mention of UMI.
For example, GSM5243620 has ~25% Bact., and GC distribution is weird. They claim having linked UMIs in the article: "Next, the 3’ ends of first-strand cDNA fragments were ligated with a linker containing Illumina-compatible P5 sequences and Unique Molecular Identifiers. "
If you must work with this data then you could align the data (do not use
salmon
) with an aligner like STAR/BBMap. Then recover the reads that mapped from original dataset (depending on where the UMI's are they will likely be soft-clipped by the aligner) and do what you need to afterwards.Use
filterbyname.sh
for extracting the mapped reads from original data files.Thank you for such a great explanation!
You could include a tool like Kraken2 and a bacterial and human databases to identify the likely source organism of your data. I like to use these in QC when I use publicly available raw dataset.
But I also agree with the other commenter, ignoring reads that don't map to your reference is a simple way of removing likely contamination (and probably a small proportion of unmapped target species reads).
Thank you for suggestion!