Dear all,
I download a single end reads fastq file from a published article in 2012, when I mapped these reads to the reference genome using bowtie2, the mapping rate is less than 1%, almost all of the reads do not mapped to the reference genome. The fastq file was used by others in other paper, and the original paper shows more than 95% alignment rate. It seems that there is some wrong with my alignment method. Could you help me find it out?
My script is very simple:
bowtie2 -x bowtie2_np7_index -U SRR094109_1.fastq -S test.sam
Looking forward to hearing from you, thank you!
Aifu.
This question has been asked multiple times. Anytime you have unexplained low alignment rates you should take a sample of the reads and blast them at NCBI to make sure you have the right sample and there is no random contamination.
Yeah, I blast it in NCBI for random 5 reads, but did not get any similar genome sequence. Each read is 36 nt. I feel a little upset about this situation.
Moreover, for the other SRA file from another study, whose read length is 35 nt, the alignment is quite normal. I wonder if there is other mistake for my script.
I had a look at three sets of 10-15 random reads from
SRR094109
and nothing seems to be showing up at NCBI blast with megablast and discontinuous megablast.But with plain
blastn
you do get results back that mostly go to plants. Even then only about ~21 out of 36 bp seem to be matching from most reads so this seems to be a particularly bad dataset, which may require explicit scanning and trimming before you align.Thank for your help genomax! I also align the reads to rice reference genome, and just about ~21 out of 36 bp matched. I do the fastqc analysis of the fastq file, it seems quite OK, no adapter exists. The original paper said they had used MAQ software with a 1-bp mismatch allowed, this is strict.
I've sent an email to the author for help.