I want to detect any nonhuman contaminants in my sequencing data (RNA/ DNA) Is there a quick tool which can provide a kind of estimate without actually aligning to the genome. I think DeconSeq is not working.
Thanks
I want to detect any nonhuman contaminants in my sequencing data (RNA/ DNA) Is there a quick tool which can provide a kind of estimate without actually aligning to the genome. I think DeconSeq is not working.
Thanks
You have several alternatives
If you don't have a clue about the origin of contamination, you have two more choices to discover that source. One is short (but need luck) and the other one is as long as mapping your reads to the human genome
One is use Kraken: Kraken use either pre-configured databases of sequences of a mixture of known organisms, or you can make your own. Then you use Kraken to figure out the source of contamination, and then, you can get rid of them by using BBSplit. You need, however, some luck to pin out the organism from which the contamination is coming. Kraken works very rapidly with the provided database, and is worth an attempt. But remember, once you discover the source of contamination, the only way to ge rid of reads is by mapping
A longer alternative is the using of blobology. But this is not as fast as Kraken, since it relies in downloading the whole nt database from NCBI and following an included script, it assemble your reads and do a mapping to discover the source of contamination.
That being said, it will be a lot better if you map your sequences to the human genome, and get rid of unmapped sequences
See these posts:
How to remove contamination from the transcriptome assembly
A: Sequence Reads Unmapped To Human Genome
Contaminating Sequences And Genome Assembly
See these papers:
http://www.sciencedirect.com/science/article/pii/S0888754314001517
if you cannot reach it see the link below:
http://www.sciencedirect.com.sci-hub.cc/science/article/pii/S0888754314001517
In addition to @Amitm's answer BBMap contains BBSplit which is designed to bin reads based on reference genomes. In your case if you only care about human sequences then you can bin those away from the rest of the data.
You can sample a random set of reads from your data by using $ reformat.sh in=reads.fq out=sampled.fq sample=NNN
from BBMap suite (replace NNN with number of reads you want to sample). Then use the sampled reads with BBSplit to test, if you don't want to process the entire file.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
hi, You might want to look at BBDuk (from BBMap suite). Here is a helpful Seqanswers thread on its use cases - http://seqanswers.com/forums/showthread.php?t=42776
It uses k-mer based filtering to pull out possible contaminants.