I have some bulk RNA-seq data on human samples that have clearly been contaminated with some non-human sources, I'm observing dismal alignment to the human genome even after generously adjusting thresholds.
I want to diagnose the source of contamination at a broad level (e.g. where are these reads coming from?)
Originally I was going to BLAST except it takes way too long and is overkill on my question (I'm wondering what non-human sources are there, not the genes).
Are there any basic packages that people know of that offer species/genuses of RNA-seq data that I can bake into my existing QC pipeline?
thanks!
use
fastqscreen
. It will screen for contamination for model organisms, human, mouse, rat and vectors by default. If you can guess the contaminant source organism, index the genome, place indices it in appropriate location, edit the config and fastqscreen will screen fastq against those genomes too.BLAST is definitely not the best tool for the job. There are a few alternatives covered here: Faster BLAST alternative
BLAST is more sensitive than all these alternatives.
Yes, but you can't realistically BLAST thousands or millions of reads. The goal here is to "diagnose the source of contamination at a broad level".