I have a problem. I try aligning paired end data (whole transcriptome) of restrictive heart cardiomyopathy using tophat and I get this summary:
tophat2 -o trialtophatRestrictive3 -p 20 -G /data2/hg19/genes.gtf /data2/hg19/genome SRR2138385_1.fastq SRR2138385_2.fastq
Left reads:
Input : 56776839
Mapped : 54092930 (95.3% of input)
of these: 3069097 ( 5.7%) have multiple alignments (4267 have >20)
Right reads:
Input : 56776839
Mapped : 420422 ( 0.7% of input)
of these: 21837 ( 5.2%) have multiple alignments (32 have >20)
48.0% overall read mapping rate.
Aligned pairs: 411231
of these: 21368 ( 5.2%) have multiple alignments
8215 ( 2.0%) are discordant alignments
The left reads map 95% while the right reads map only 0.7% What should I do?
I did fastqc before tophat and the reports for the forward and reverse were fine.
I appreciate your help
Thanks a lot
Sarah
What's the length distribution of left and right reads according to fastqc report?
Hey Noolean, thank you so much for your reply, I appreciate your effort!
The data was on array express already split into forward and reverse files:http://www.ebi.ac.uk/ena/data/view/SRS1019180
I only did fastqc on each file, the forward and reverse and got good quality reports:
Only per base sequence content, per sequence GC content, sequence duplication levels and Kmers were marked with an x. The rest were marked as correct.
The ength distribution of left and right reads according to fastqc report is exactly the same:
from 98 to 100 with a peak at 99.
What do you suggest I should do?
Thank you so much,
Sarah
Check error rates, R2 reads are usually lower in quality than R1, trimming for quality might help.
Hey apelin20, thank you so much for your reply, I appreciate your effort!
The data was on array express already split into forward and reverse files:http://www.ebi.ac.uk/ena/data/view/SRS1019180
I only did fastqc on each file, the forward and reverse and got good quality reports:
Only per base sequence content, per sequence GC content, sequence duplication levels and Kmers were marked with an x. The rest were marked as correct.
What do you mean by error rates? Do you think I need trimming? There are no signs of adapters or primers in the overrepresented sequences.
Thank you,
Sarah