Question

The left reads map 95% while the right reads map only 0.7% using tophat

0

Entering edit mode

9.4 years ago

s_halawa • 0

I have a problem. I try aligning paired end data (whole transcriptome) of restrictive heart cardiomyopathy using tophat and I get this summary:

tophat2 -o trialtophatRestrictive3 -p 20 -G /data2/hg19/genes.gtf /data2/hg19/genome SRR2138385_1.fastq SRR2138385_2.fastq

Left reads:
Input : 56776839
Mapped : 54092930 (95.3% of input)
of these: 3069097 ( 5.7%) have multiple alignments (4267 have >20)
Right reads:
Input : 56776839
Mapped : 420422 ( 0.7% of input)
of these: 21837 ( 5.2%) have multiple alignments (32 have >20)
48.0% overall read mapping rate.
Aligned pairs: 411231
of these: 21368 ( 5.2%) have multiple alignments
8215 ( 2.0%) are discordant alignments

The left reads map 95% while the right reads map only 0.7% What should I do?

I did fastqc before tophat and the reports for the forward and reverse were fine.

I appreciate your help

Thanks a lot

Sarah

next-gen RNA-Seq alignment • 3.4k views

ADD COMMENT • link updated 2.1 years ago by Ram 45k • written 9.4 years ago by s_halawa • 0

1

Entering edit mode

What's the length distribution of left and right reads according to fastqc report?

ADD REPLY • link 9.4 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

Hey Noolean, thank you so much for your reply, I appreciate your effort!

The data was on array express already split into forward and reverse files:http://www.ebi.ac.uk/ena/data/view/SRS1019180

I only did fastqc on each file, the forward and reverse and got good quality reports:

Only per base sequence content, per sequence GC content, sequence duplication levels and Kmers were marked with an x. The rest were marked as correct.

The ength distribution of left and right reads according to fastqc report is exactly the same:

from 98 to 100 with a peak at 99.

What do you suggest I should do?

Thank you so much,

Sarah

ADD REPLY • link 9.4 years ago by s_halawa • 0

1

Entering edit mode

Check error rates, R2 reads are usually lower in quality than R1, trimming for quality might help.

ADD REPLY • link 9.4 years ago by apelin20 ▴ 490

0

Entering edit mode

Hey apelin20, thank you so much for your reply, I appreciate your effort!

The data was on array express already split into forward and reverse files:http://www.ebi.ac.uk/ena/data/view/SRS1019180

I only did fastqc on each file, the forward and reverse and got good quality reports:

Only per base sequence content, per sequence GC content, sequence duplication levels and Kmers were marked with an x. The rest were marked as correct.

What do you mean by error rates? Do you think I need trimming? There are no signs of adapters or primers in the overrepresented sequences.

Thank you,

Sarah

ADD REPLY • link 9.4 years ago by s_halawa • 0

score 3 · Answer 1 · 2016-01-04

3

Entering edit mode

9.4 years ago

Antonio R. Franco ★ 5.2k

I am wondering what happens if after trimming for quality, you left your fastq unordered. I mean that it is likely one of the paired read in one of the file has been full erased after the trimming leaving its mate orphan in the other file. If so, the number of "properly paired mapped" will drop dramatically, because I think bowtie2 is expecting synchronized fastq files

Is the TopHat`s summary giving the statistic of properly pairs only?

Why you don't simply try to map the second file alone?. This will give you a hint

ADDED NOTE: I noticed that fastq extraction from SRA archives by using fastq-dump works much better with the --split-3 legacy qualificator. In some cases (not always) a third fastq file is created that include all the orphan reads. With the --split-files you are not ensuring your fastq files are synchronized

ADD COMMENT • link 9.4 years ago by Antonio R. Franco ★ 5.2k

0

Entering edit mode

Hey Antonio, thank you so much for your reply, I appreciate your effort!

The data was on array express already split into forward and reverse files:http://www.ebi.ac.uk/ena/data/view/SRS1019180

I only did fastqc on each file, the forward and reverse and got good quality reports:

Only per base sequence content, per sequence GC content, sequence duplication levels and Kmers were marked with an x. The rest were marked as correct.

So I did tophat right away without any filtering and got the aformentioned alignment summary.

What do you suggest I should do?

Thank you,

Sarah

ADD REPLY • link 9.4 years ago by s_halawa • 0

1

Entering edit mode

Run a tophat with only the low mapped file

ADD REPLY • link 9.4 years ago by Antonio R. Franco ★ 5.2k