Question

Selecting best RNA-seq mapping with TopHat

0

Entering edit mode

10.7 years ago

mjoyraj ▴ 80

After mapping my PE RNA-seq with Tophat with three different options (given below). I found the following mapping summary given below in table.

1) Without reference annotation:

tophat -p 8 -r 50 -o "output" "indexed_genome_file" R1.fastq R2.fastq

2) With reference annotation:

tophat -p 8 -G "genes.gtf" -o "tophat_RABM" "Genome" R1.fastq R2.fastq

3) With reference annotation disabling novel junctions:

tophat --no-novel-juncs -p 8 -G "genes.gtf" -o "tophat_RABM" "Genome" R1.fastq R2.fastq

                                Mapping reads to
                                genome with
                                TopHat

                                With reference         With reference             Without
                                annotation             annotation disabling       reference
                                                       novel junctions            annotation
Left reads       
            Input               71926313               71926313                   71926313

            Mapped              62199375 (86.5%        60663645 (84.3%            61835864 (86.0%
                                of input)              of input)                   of input)

                    Multiple    10352306 (16.6%)       11540865 (19.0%)           15034571 (24.3%)
                    alignment   (477254 have >20)      (508565 have >20)          (665149 have >20)



Right reads       
            Input               71926313               71926313                   71926313

            Mapped              62071170 (86.3%        60575371 (84.2%            61694450 (85.8%
                                of input)              of input)                   of input)

                    Multiple    10352990 (16.7%)       11553883 (19.1%)           15030545 (24.4%)
                    alignment   (477253 have >20)      (508573 have >20)          (665010 have >20)


Overall                         86.40%                 84.30%                     85.90%
mapping
rate

            Aligned             57244789               55041529                   56591033
            pairs

                    Multiple    9527350 (16.6%)        10609333 (19.3%)           13776265 (24.3%)
                    alignment

                    Discordant  4048217 (7.1%)         4044795 (7.3%)             3755391 (6.6%)
                    alignment

                    Concordant  74.00%                 70.90%                     73.50%
                    alignment


No. of                          144075                 97906                      140296

Accordingly, I thought "with reference annotation" is the best one. But when I viewed the BAM file with junctions, I found there is lot of junctions with high depth between very distantly located genes. My genes of interest are duplicate genes. I guess pre-filtering the mapping along with some other arguments will further improve the mapping, so I thought of running the mapping with the following options:

tophat \
  -p 8 \
  -G genes.gtf \
  -o SRX528281_tophat_RABM_Prefilter \
  --no-mixed \
  --no-discordant \
  --max-multihits 2 \
  --prefilter-multihits \
  --read-realign-edit-dist 0 \
  Genome \
  R1.fastq R2.fastq

Whether my approach is correct...?? Whether the options included will improve the mapping without excluding important information's..?? Any suggestion will be highly appreciated....

alignment RNA-Seq • 4.2k views

ADD COMMENT • link updated 3.5 years ago by Ram 45k • written 10.7 years ago by mjoyraj ▴ 80

0

Entering edit mode

I think think you are just making it complicated. If you have a GTF file, just use it. If you are not interested in novel transcripts, disable it.

Anyway these quantitative changes exists even if you run the tools with same set of options multiple times.

ADD REPLY • link 10.7 years ago by GouthamAtla 12k

0

Entering edit mode

Some of the duplicate genes have high similarity or highly similar sequence pattern. So I think, if I do not filter the multiple hits, some false positive novel junctions will be revealed with significance. Though I have a GTF file, the genes on which I am interested are mostly predicted. So from the RNA seq results I am trying to re-annotate the genes and also looking for if the genes have isoforms...

ADD REPLY • link 10.7 years ago by mjoyraj ▴ 80