Entering edit mode
7.2 years ago
Sachin
▴
10
Hi, We have done some sequencing of drosophila dna. I ran fastqc and the results were good except in the duplicate sequences section. I have not done data trimming before alignment. With two sets of data we are getting very different mapping rates. Very low with one:
4378379 reads; of these:
54378379 (100.00%) were unpaired; of these:
51703307 (95.08%) aligned 0 times
1724019 (3.17%) aligned exactly 1 time
951053 (1.75%) aligned >1 times
4.92% overall alignment rate
Better with the other:
64029342 reads; of these:
64029342 (100.00%) were unpaired; of these:
16392556 (25.60%) aligned 0 times
40232444 (62.83%) aligned exactly 1 time
7404342 (11.56%) aligned >1 times
74.40% overall alignment rate
What could the reason for this be?
One other trick you can try:
Will show you the top 10 most common unmapped reads. The command will take some time to finish, but those sequences might be more useful for blasting than randomly chosen unmapped reads.
Did you use correct reference genome for first datasets? Can you please share the command used in the analysis
Yes. I used the same genome for both the datasets. Here is the command I used : bowtie2 -p 12 -x /bowtie2index/dm6 -U File1.fq -S File1.sam
Any time there is unexpected low % mapping, you need to take a small/random selection of reads and blast them at NCBI. If you have a problem with contamination of some sort, it will quickly become apparent.
I did this recently with a mapping rate of 40%. The data was supposed to be mouse data, BLAST matched a subset with mouse and also human. After raising the issue with the sequencing company, they confirmed the sample was contaminated with human DNA (don't even get me started on why they didn't check for this before sending us results!)
Great suggestion genomax!
The alignment for the first sample is pretty shocking (i.e. poor). It's as if the DNA was from a different genus. In fact, I have aligned human DNA to a mouse genome in the past and achieved better alignment.
I'm having similar mapping results to dm6 too, some sample is lower than 5% and some is higher than 70%. Did you solve your mapping problem? Any suggestions? Thank you!
What data has been sequenced? genome or transcriptome?
Also, what are you mapping upon? - genome or transcriptome ?
For mapping RNA-seq data onto genome, it is recommended to use HISAT, tophat or STAR aligner
I am having poor alignment rates too (less than 30% with a congeneric species!) Is this what you are supposed to get, considering that my data is GBS (short reads) with a max length of 90 bp (but mostly shorter) ?
Thanks for the help
Probably not. Have you checked some of the non-mapping reads via blast to see what they are?