I have RNASeq of algal cultures so my samples are not really axenic. Instead, the presence of some bacteria or fungus is to be expected. I assembled my reads with Trinity and now I would like to estimate the origin of each individual contig. Ideally, I would like to get visualization of where the contamination is coming from on the Tree of Life as a:
- quality metrics that the origin of my contamination makes sense (and I see what I expect to see for algal cultures)
- to remove contaminants and "clean" the assembly
Is there any tool that could do this for me?
I started by automatically outputting the "best" blast hit for each contig, but I am getting large variety of the hits and I am not sure how to summarize them or properly assign them phylogenetically.
Thanks for help.
NCBI has a new
ref_prok_rep
(representative prokaryotic genomes) pre-made blast database available. Since you have assembled sequences you could do a quick blast against that to see if you can find any low hanging fruits in terms of identification.Why not start with filtering out reads that can be mapped to known bacterial/fungal species?
Can you point me to such list/database?
I'm a big fan of Kraken for screening against contamination, the program assigns a taxid to each read, with a little leg work you could filter off of that. If you use the kraken-translate tool you should be able to get the whole taxonomy for each read and filter there using keywords. E.g. get a list of reads with the word "bacteria" in their kraken-translate entry, then toss all of those reads from your reads.
https://ccb.jhu.edu/software/kraken/MANUAL.html#output-format
I am honestly not sure which is better: clean before assembly or after assembly.