Entering edit mode
4.6 years ago
Anand Rao
▴
640
For my pipeline to pre-process RNA-Seq reads prior to reference genome mapping, I have assessed contaminant levels for various sequences using FastQ_Screen (from Babraham Bioinformatics, that brought us the widely used FastQC).
I have pasted below FastQ_Screen results for RAW READS (before any of the pre-processing steps) and FINAL PROCESSED READS (after all of my pre-processing steps have been completed)
Based on those 2 data tables below, could you please comment on whether:
- contaminant levels in final processed reads are low enough to use for mapping to ref. genome?
- Is it safe to assume that persistent low contaminant levels, for cat, dog, mouse, human will just contribute to un-mapped category during mapping to my plant target genome, rather than results in inaccurate mapping?
- contaminant levels in raw reads were originally low enough, indicating this library was a decent sample to start off with?
- the differences between raw reads and full processed reads suggest over-processing?
- any other observation pops up that I have not considered even inquiring about...
Thank you!
FASTQ_Screen for RAW READS
Genome / Reference #Reads_processed #Unmapped %Unmapped #One_hit_one_genome %One_hit_one_genome #Multiple_hits_one_genome %Multiple_hits_one_genome #One_hit_multiple_genomes %One_hit_multiple_genomes Multiple_hits_multiple_genomes %Multiple_hits_multiple_genomes
adapters 14198683 14189047 99.94 18 0 1 0 1833 0.01 7784 0.05
PhiX 14198683 14198683 100 0 0 0 0 0 0 0 0
lambda 14198683 14198683 100 0 0 0 0 0 0 0 0
UniVec 14198683 14183620 99.89 27 0 38 0 1496 0.01 13502 0.1
Bacterial_masked 14198683 14101866 99.32 200 0 62423 0.44 541 0 33653 0.24
Bact_Symbiont 14198683 14175648 99.84 2 0 110 0 18 0 22905 0.16
Mitoch 14198683 14127750 99.5 0 0 0 0 68693 0.48 2240 0.02
rRNA 14198683 12192293 85.87 0 0 0 0 380549 2.68 1625841 11.45
Target_Ref_genome 14198683 277861 1.96 8511350 59.94 3272369 23.05 50369 0.35 2086734 14.7
Cat_masked 14198683 14071938 99.12 484 0 126 0 74413 0.52 51722 0.36
Dog_masked 14198683 14085865 99.21 697 0 209 0 76317 0.54 35595 0.25
Mouse_masked 14198683 13967382 98.38 450 0 155 0 90013 0.63 140683 0.99
Human_masked 14198683 14121230 99.46 377 0 75 0 48239 0.34 28762 0.2
FASTQ_Screen for FINAL PROCESSED READS
Genome / Reference #Reads_processed #Unmapped %Unmapped #One_hit_one_genome %One_hit_one_genome #Multiple_hits_one_genome %Multiple_hits_one_genome #One_hit_multiple_genomes %One_hit_multiple_genomes Multiple_hits_multiple_genomes %Multiple_hits_multiple_genomes
adapters 11269161 11269161 100 0 0 0 0 0 0 0 0
PhiX 11269161 11269161 100 0 0 0 0 0 0 0 0
lambda 11269161 11269161 100 0 0 0 0 0 0 0 0
UniVec 11269161 11268923 100 0 0 0 0 139 0 99 0
Bacterial_masked 11269161 11252080 99.85 58 0 13305 0.12 262 0 3456 0.03
Bact_Symbiont 11269161 11266803 99.98 1 0 0 0 23 0 2334 0.02
Mitoch 11269161 11230197 99.65 0 0 0 0 38149 0.34 815 0.01
rRNA 11269161 11263548 99.95 0 0 0 0 1047 0.01 4566 0.04
Target_Ref_genome 11269161 101482 0.9 7978575 70.8 3115013 27.64 23426 0.21 50665 0.45
Cat_masked 11269161 11253212 99.86 8 0 9 0 4986 0.04 10946 0.1
Dog_masked 11269161 11251633 99.85 23 0 20 0 6081 0.05 11404 0.1
Mouse_masked 11269161 11251045 99.84 20 0 11 0 5271 0.05 12814 0.11
Human_masked 11269161 11256012 99.88 14 0 3 0 4471 0.04 8661 0.08
Why are you doing this if I may ask? Generally if your data is NOT aligning to the expected genome at a high enough rate (it will never be 100%), then one goes genome fishing. Since you are aligning short reads some background level of alignment is likely to happen by chance.
So what is acceptable background level for a contaminant? 1%, 0.1. 001%? Especially when the reference sequences being checked has been masked for sequences found in the target genome?
You seem to be approaching this from a different angle than many. If I have reasonably high fraction of reads that align to the right genome then I generally do not worry about what got left behind.
I think these contamination levels are ok. If I read this correctly, most of the contaminant reads map to multiple genomes? In that case, you're literally dealing with < 1% contamination. Furthermore, Since your target is plant, you should be safe to just map everything and the mammalian reads should go to unmapped. To go one step further towards safety, you could also extract the contaminant reads, map them against your plant reference and see what happens.
That is correct, most are reads that are either 'Multiple_hits_one_genome' OR 'Multiple_hits_multiple_genome'
My guess is that these are reads are likely to contain / map to highly repetitive sequences common across both plant and animal kingdoms...
It should be easy to take the 'contaminant reads' and map them to my plant ref.genome - thanks for that suggestion. Cheers!