Hi!
I have received a series of human poly-A RNA-seq samples (single-end 75 bp) which display suspicious mapping values. These samples have been mapped with STAR and show +/- 30-50% of reads "unmapped: reads too short". Previous samples done with the same method had only between 5 and 10%.
Despite the sharp drops of uniquely mapping reads the sequencing worked well (many genes detected, mapping to exons, splicing visible, ...).
After careful inspection of the reads I start to suspect a bacterial contamination as:
- Many of the blasted reads are a perfect match with E. Coli or other prokaryotes.
- These are not ribosomal reads (evaluated with BBDuk).
- They do not appear to contain the primers / adapter sequences used in the library preparation.
- If I map these reads to a hybrid E. Coli 16S - h38 genome I get 10-100 times more reads mapping to this E. Coli genome in these new samples than in the old ones.
I would like to evaluate the proportion of reads coming from prokaryotes (E. Coli?) in these samples. As I am not familiar with the metagenomics field, I was wondering if someone could recommend a procedure to do so.
I am also open to other suggestions regarding the possible issues with these samples.
Thank you in advance!
try with
fastqscreen
. Index the E. coli genome, edit the configuration file. Fastqscreen prints our the contamination levels. Please increase the numbers of reads to be analyzed.