Question

Tophat mapping less reads (and less properly paired) when more real -r is specified?

0

Entering edit mode

9.7 years ago

manekineko ▴ 150

First, I have run tophat with -r 50, however I have calculated than the mean inner distance is around 0

mean insert (195) - 2xread_size(100) = mean distance between mate pairs (-5).

I got with -5:

40915584 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
40915584 + 0 mapped (100.00%:-nan%)
40915584 + 0 paired in sequencing
20381317 + 0 read1
20534267 + 0 read2
21446582 + 0 properly paired (52.42%:-nan%)
33030608 + 0 with itself and mate mapped
7884976 + 0 singletons (19.27%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
52783519 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates

and with 50:

52783519 + 0 mapped (100.00%:-nan%)
52783519 + 0 paired in sequencing
26328045 + 0 read1
26455474 + 0 read2
33127790 + 0 properly paired (62.76%:-nan%)
42606738 + 0 with itself and mate mapped
10176781 + 0 singletons (19.28%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

The library is RNA-seq mapped to hg19, the total reads in FASTQ R1 are 78751541, so I'm confused.

tophat RNA-Seq • 2.6k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 9.7 years ago by manekineko ▴ 150

0

Entering edit mode

What are your read lengths? 100? 250? 300? Pair end or no?

ADD REPLY • link 9.7 years ago by jnf3769 ▴ 40

0

Entering edit mode

paired-end 100nt

ADD REPLY • link 9.7 years ago by manekineko ▴ 150

0

Entering edit mode

This isn't likely a sufficient answer to your question, but when I use tophat (for rough analysis), I tend to use -r with 1/3 the read length. Giving tophat too high a number for -r has always caused me problems in the past. When I went to the documentation, it gives the example of entering -r 200 for 300bp (paired end) reads.

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 9.7 years ago by jnf3769 ▴ 40

0

Entering edit mode

Why you have different numbers of starting reads (40915584 vs 52783519) for both cases?

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

These I think are not starting reads but a mapped ones.

ADD REPLY • link 9.7 years ago by manekineko ▴ 150

0

Entering edit mode

My bad. This is an output from samtools flagstat. I thought this is a log file from Tophat2. I definitely need some coffee. Can't say whats going on right now but I would definitely go with results from the second case.

ADD REPLY • link updated 2.4 years ago by Ram 45k • written 9.7 years ago by Ashutosh Pandey 12k

Ram · Answer 1 · 2015-07-28

0

Entering edit mode

9.7 years ago

Ashutosh Pandey 12k

Take any of your bam file and run the following command:

samtools view Input.bam | cut -f 9 | head -30000 | sort -n | awk '{ a[i++]=$1; } END { print a[int(i/2)]; }'

This command will calculate an approximate value for the median of the insert sizes. Now subtract 2 * read length from the median value to get the mean insert distance.

Read this post for more explanation: How Are Paired-End Read Insert Sizes Inferred For Reporting In A Sam File?

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 9.7 years ago by Ashutosh Pandey 12k

0

Entering edit mode

I have already calculated the median/mean insert size with Picard and calculate the distance to be -5 - my question is my tophat result is worse when I input the exact calculated distance than the run before with the default value of 50?!

ADD REPLY • link 9.7 years ago by manekineko ▴ 150

Ram · Answer 2 · 2015-07-28

Why don't u see the bio analyser profile for the fragment length distribution and infer the insert length? Or the best way would be to map both the r1 and r2 reads separately and calculate the insert length from the resulting two bam files. This will overcome the alignment bias introduced by specifying certain mean insert length.

The default mean insert length of 50 may be due to the fact that the standard Illumina rna-seq libraries prepared for 100bp paired end run will have a fragment length of 250bp before they go into sequencer. So, when 100bp read from both the ends, the inner distance left will be 50bp. This will work for most of the Illumina 100bp PE data. But if your data is different, you need to calculate the insert size or ask people who have prepared the libraries.