Question

Mapping with Tophat

0

Entering edit mode

8.4 years ago

snp87 ▴ 80

I'm so sorry this might be a very basic question, but can I get some help with the command to map paired end RNAseq data to the mouse genome. I downloaded the reference mouse genome (fa file) and annotation file (gtf file). I indexed the reference genome with bowtie. There were several output files produced (ebwtl format) but is it still the fa file that I use?

This is the command I was trying to use:

tophat -p 4 -G Mus_musculus.GRCm38.84.gtf -o Mus_musculus.GRCm38.dna.toplevel.fa /T/rnaseq_data/Snp-1_S3_R1_001.fastq.gz /T/rnaseq_data/Snp-1_S3_R1_001.fastq.gz

I made a separate directory to save the mapping data. I have the reference genome and annotation files in a separate folder and the RNAseq data in another. Do I have to specify where the reference genome and annotation files are like I do for the RNAseq data? And 4 is the number of cores on my machine right?

Thanks a lot!

RNA-Seq Tophat • 4.7k views

ADD COMMENT • link updated 8.4 years ago by Brian Bushnell 20k • written 8.4 years ago by snp87 ▴ 80

0

Entering edit mode

Use the name (as it appears before the .ebwt extensions) as the index "basename" (genome_index_base). The order you provide the options on the command line is important.

tophat [options]* <genome_index_base> PE_reads_1.fq.gz PE_reads_2.fq.gz

If your files/annotations are in different directories then provide full/relative paths to them in your command.

4 is the number of cores you are asking tophat to use for the mapping.

Do not use the name of your fasta file as the output directory (-o option).

ADD REPLY • link 8.4 years ago by GenoMax 147k

0

Entering edit mode

Thanks so much for the quick reply. There were 6 ebwtl files (X.1.ebwtl, X.2.ebwtl, X.3.ebwtl, X.4.ebwtl, X.rev.1.ebwtl, X.rev.2.ebwtl) so as the genome_index_base do I just add X with no file extension? And if the files are located in /~/T/references should the command be /~/T/references/X?

ADD REPLY • link 8.4 years ago by snp87 ▴ 80

0

Entering edit mode

That is correct. ~~(ignore: need to add padding to pass character lmt)~~

ADD REPLY • link 8.4 years ago by GenoMax 147k

0

Entering edit mode

I'm getting an error: cannot find transcript file /~/T/references/Mus_musculus.GRCm38.dna.toplevel I know that's the correct path of the file. Do you know if there could be anything else that might be the problem?

And I'm not sure what you mean by need to add padding to pass character lmt. Sorry, I'm new to unix so that might be why

ADD REPLY • link 8.4 years ago by snp87 ▴ 80

1

Entering edit mode

I don't think " /~/T/references/Mus_musculus.GRCm38.dna.toplevel" is the correct path. Path should not start with "/~".

Before you run any commands, run head (head file) on every input file to make sure it is there and looks normal.

ADD REPLY • link 8.4 years ago by igor 13k

0

Entering edit mode

Can you post the command you are using?

Biostars needs minimum 20 characters in each post so I had to add some extra characters. Nothing to do with unix.

ADD REPLY • link 8.4 years ago by GenoMax 147k

score 0 · Answer 1 · 2016-06-25

0

Entering edit mode

8.4 years ago

Jeffin Rockey ★ 1.3k

Could you please do as follows:

Use bowtie2

execute : bowtie2-build -f Mus_musculus.GRCm38.dna.toplevel.fa Mus_musculus.GRCm38.dna.toplevel

It would create index files like Mus_musculus.GRCm38.dna.toplevel.1.bt2 ,Mus_musculus.GRCm38.dna.toplevel.rev.1.bt2 etc. (Please notice that the .1.bt2,.rev.1.bt2 etc got suffixed to the second argument that you gave in the command and that itself is the index base name that you would need to give while executing tophat as below .You wont be required to use the actual .fa file as such anymore.)

execute: tophat2 -p 4 -G Mus_musculus.GRCm38.84.gtf Mus_musculus.GRCm38.dna.toplevel data-R1.fastq data-R2.fastq

(You will have the output bam files in folder named tophat_out by default.)

ADD COMMENT • link 8.4 years ago by Jeffin Rockey ★ 1.3k

0

Entering edit mode

Thank you so much, the instructions you gave worked really well. I had used bowtie instead of bowtie2 which resulted in the ebwtl files instead of the bt files.

ADD REPLY • link 8.4 years ago by snp87 ▴ 80

0

Entering edit mode

The alignment only took 1.5 hours though and I've heard it was a lot longer for a lot of other people (3-5 days). For example someone who was using a machine with 12 processors mentioned it took 1 hour for alignment. Based on the alignment rate it seems to be working fine.

Left reads: Input : 4676814 Mapped : 4370224 (93.4% of input) of these: 427856 ( 9.8%) have multiple alignments (16828 have >20) Right reads: Input : 4676814 Mapped : 4250001 (90.9% of input) of these: 418848 ( 9.9%) have multiple alignments (16767 have >20) 92.2% overall read mapping rate.

Aligned pairs: 4086718 of these: 403666 ( 9.9%) have multiple alignments 170501 ( 4.2%) are discordant alignments 83.7% concordant pair alignment rate.

Do you think it's normal that it's aligning so fast?

ADD REPLY • link 8.4 years ago by snp87 ▴ 80

0

Entering edit mode

Yes. Just keep going with your analysis as long as steps are completing without errors.

ADD REPLY • link 8.4 years ago by GenoMax 147k

score 0 · Answer 2 · 2016-06-25

0

Entering edit mode

8.4 years ago

Brian Bushnell 20k

TopHat is an obsolete program and even its developers do not recommend that anyone use it. Rather, they suggest RSEM. I have never tried RSEM, but I doubt it could be worse than TopHat. Personally, I recommend BBMap for RNA-seq, but I'm biased since I developed it. For vertebrates I suggest adding the flags "intronlen=10 maxindel=200000".

This is basically a public-service announcement, not an advertisement. I personally think BBMap is the best splice-aware RNA-seq aligner out there, but there are many others as well. You can use any of them. But please, for the sake of science, don't use TopHat, or anything in the Tuxedo pipeline.

ADD COMMENT • link 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

HISAT2 is the successor of TopHat2, not RSEM. I wish they wouldn't have declared TopHat2 obsolete. :( I've been using it for years, and it works fine, other than being slow.

I've been trying our other programs, but I really miss TopHat2. I never found any faults with its alignments, and was quite happy to wait a few hours for the results. I just now feel that any analyses results from TopHat2 will be questioned since its developers have declared HISAT2 to be more accurate.

I haven't tried out BBMAP yet, but I really should. I do feel that BBMAP is missing a publication that explains the underlying algorithms, and compares the performance of BBMAP to other programs.

STAR is fast but has huge memory requirements, and a dizzying and confusing array of options. I've been trying out STAR, since it now seems to be the most popular splice-aware aligner.

There has been a few question marks regarding the number of pseudogene counts outputted by HISAT2. I'm not quite sure if this is a real problem yet, so I haven't used HISAT2 yet. Current state-of-the-art for RNA-Seq gene / transcript expression quantification OR just the tools of preference

ADD REPLY • link 8.4 years ago by ablanchetcohen ★ 1.2k

0

Entering edit mode

If you are concerned with memory requirements or complicated options, just skip TopHat/HISAT/STAR entirely. There is a whole new class of RNA-seq aligners that popped up recently that make the whole process a lot quicker and easier without sacrificing accuracy. Try Kallisto: https://pachterlab.github.io/kallisto/starting