After mapping my PE RNA-seq with Tophat with three different options (given below). I found the following mapping summary given below in table.
1) Without reference annotation:
tophat -p 8 -r 50 -o "output" "indexed_genome_file" R1.fastq R2.fastq
2) With reference annotation:
tophat -p 8 -G "genes.gtf" -o "tophat_RABM" "Genome" R1.fastq R2.fastq
3) With reference annotation disabling novel junctions:
tophat --no-novel-juncs -p 8 -G "genes.gtf" -o "tophat_RABM" "Genome" R1.fastq R2.fastq
Mapping reads to
genome with
TopHat
With reference With reference Without
annotation annotation disabling reference
novel junctions annotation
Left reads
Input 71926313 71926313 71926313
Mapped 62199375 (86.5% 60663645 (84.3% 61835864 (86.0%
of input) of input) of input)
Multiple 10352306 (16.6%) 11540865 (19.0%) 15034571 (24.3%)
alignment (477254 have >20) (508565 have >20) (665149 have >20)
Right reads
Input 71926313 71926313 71926313
Mapped 62071170 (86.3% 60575371 (84.2% 61694450 (85.8%
of input) of input) of input)
Multiple 10352990 (16.7%) 11553883 (19.1%) 15030545 (24.4%)
alignment (477253 have >20) (508573 have >20) (665010 have >20)
Overall 86.40% 84.30% 85.90%
mapping
rate
Aligned 57244789 55041529 56591033
pairs
Multiple 9527350 (16.6%) 10609333 (19.3%) 13776265 (24.3%)
alignment
Discordant 4048217 (7.1%) 4044795 (7.3%) 3755391 (6.6%)
alignment
Concordant 74.00% 70.90% 73.50%
alignment
No. of 144075 97906 140296
Accordingly, I thought "with reference annotation" is the best one. But when I viewed the BAM file with junctions, I found there is lot of junctions with high depth between very distantly located genes. My genes of interest are duplicate genes. I guess pre-filtering the mapping along with some other arguments will further improve the mapping, so I thought of running the mapping with the following options:
tophat \
-p 8 \
-G genes.gtf \
-o SRX528281_tophat_RABM_Prefilter \
--no-mixed \
--no-discordant \
--max-multihits 2 \
--prefilter-multihits \
--read-realign-edit-dist 0 \
Genome \
R1.fastq R2.fastq
Whether my approach is correct...?? Whether the options included will improve the mapping without excluding important information's..?? Any suggestion will be highly appreciated....
I think think you are just making it complicated. If you have a GTF file, just use it. If you are not interested in novel transcripts, disable it.
Anyway these quantitative changes exists even if you run the tools with same set of options multiple times.
Some of the duplicate genes have high similarity or highly similar sequence pattern. So I think, if I do not filter the multiple hits, some false positive novel junctions will be revealed with significance. Though I have a GTF file, the genes on which I am interested are mostly predicted. So from the RNA seq results I am trying to re-annotate the genes and also looking for if the genes have isoforms...