I'm so sorry this might be a very basic question, but can I get some help with the command to map paired end RNAseq data to the mouse genome. I downloaded the reference mouse genome (fa file) and annotation file (gtf file). I indexed the reference genome with bowtie. There were several output files produced (ebwtl format) but is it still the fa file that I use?
I made a separate directory to save the mapping data. I have the reference genome and annotation files in a separate folder and the RNAseq data in another. Do I have to specify where the reference genome and annotation files are like I do for the RNAseq data? And 4 is the number of cores on my machine right?
Use the name (as it appears before the .ebwt extensions) as the index "basename" (genome_index_base). The order you provide the options on the command line is important.
Thanks so much for the quick reply. There were 6 ebwtl files (X.1.ebwtl, X.2.ebwtl, X.3.ebwtl, X.4.ebwtl, X.rev.1.ebwtl, X.rev.2.ebwtl) so as the genome_index_base do I just add X with no file extension? And if the files are located in /~/T/references should the command be /~/T/references/X?
I'm getting an error: cannot find transcript file /~/T/references/Mus_musculus.GRCm38.dna.toplevel
I know that's the correct path of the file. Do you know if there could be anything else that might be the problem?
And I'm not sure what you mean by need to add padding to pass character lmt. Sorry, I'm new to unix so that might be why
It would create index files like Mus_musculus.GRCm38.dna.toplevel.1.bt2 ,Mus_musculus.GRCm38.dna.toplevel.rev.1.bt2 etc.
(Please notice that the .1.bt2,.rev.1.bt2 etc got suffixed to the second argument that you gave in the command and that itself is the index base name that you would need to give while executing tophat as below .You wont be required to use the actual .fa file as such anymore.)
Thank you so much, the instructions you gave worked really well. I had used bowtie instead of bowtie2 which resulted in the ebwtl files instead of the bt files.
The alignment only took 1.5 hours though and I've heard it was a lot longer for a lot of other people (3-5 days). For example someone who was using a machine with 12 processors mentioned it took 1 hour for alignment. Based on the alignment rate it seems to be working fine.
Left reads:
Input : 4676814
Mapped : 4370224 (93.4% of input)
of these: 427856 ( 9.8%) have multiple alignments (16828 have >20)
Right reads:
Input : 4676814
Mapped : 4250001 (90.9% of input)
of these: 418848 ( 9.9%) have multiple alignments (16767 have >20)
92.2% overall read mapping rate.
Aligned pairs: 4086718
of these: 403666 ( 9.9%) have multiple alignments
170501 ( 4.2%) are discordant alignments
83.7% concordant pair alignment rate.
Do you think it's normal that it's aligning so fast?
TopHat is an obsolete program and even its developers do not recommend that anyone use it. Rather, they suggest RSEM. I have never tried RSEM, but I doubt it could be worse than TopHat. Personally, I recommend BBMap for RNA-seq, but I'm biased since I developed it. For vertebrates I suggest adding the flags "intronlen=10 maxindel=200000".
This is basically a public-service announcement, not an advertisement. I personally think BBMap is the best splice-aware RNA-seq aligner out there, but there are many others as well. You can use any of them. But please, for the sake of science, don't use TopHat, or anything in the Tuxedo pipeline.
HISAT2 is the successor of TopHat2, not RSEM.
I wish they wouldn't have declared TopHat2 obsolete. :(
I've been using it for years, and it works fine, other than being slow.
I've been trying our other programs, but I really miss TopHat2.
I never found any faults with its alignments, and was quite happy to wait a few hours for the results.
I just now feel that any analyses results from TopHat2 will be questioned since its developers have declared HISAT2 to be more accurate.
I haven't tried out BBMAP yet, but I really should.
I do feel that BBMAP is missing a publication that explains the underlying algorithms, and compares the performance of BBMAP to other programs.
STAR is fast but has huge memory requirements, and a dizzying and confusing array of options.
I've been trying out STAR, since it now seems to be the most popular splice-aware aligner.
If you are concerned with memory requirements or complicated options, just skip TopHat/HISAT/STAR entirely. There is a whole new class of RNA-seq aligners that popped up recently that make the whole process a lot quicker and easier without sacrificing accuracy. Try Kallisto:
https://pachterlab.github.io/kallisto/starting
I was hoping to use STAR for the alignment but the huge memory requirements were posing problems so switched to Tophat instead. I wasn't aware that Tophat was obsolete now. Will look in to other options including HISAT2 and RSEM mentioned - thanks!
Use the name (as it appears before the .ebwt extensions) as the index "basename" (genome_index_base). The order you provide the options on the command line is important.
If your files/annotations are in different directories then provide full/relative paths to them in your command.
4 is the number of cores you are asking tophat to use for the mapping.
Do not use the name of your fasta file as the output directory (
-o
option).Thanks so much for the quick reply. There were 6 ebwtl files (X.1.ebwtl, X.2.ebwtl, X.3.ebwtl, X.4.ebwtl, X.rev.1.ebwtl, X.rev.2.ebwtl) so as the genome_index_base do I just add X with no file extension? And if the files are located in /~/T/references should the command be /~/T/references/X?
That is correct.
(ignore: need to add padding to pass character lmt)I'm getting an error: cannot find transcript file /~/T/references/Mus_musculus.GRCm38.dna.toplevel I know that's the correct path of the file. Do you know if there could be anything else that might be the problem?
And I'm not sure what you mean by need to add padding to pass character lmt. Sorry, I'm new to unix so that might be why
I don't think " /~/T/references/Mus_musculus.GRCm38.dna.toplevel" is the correct path. Path should not start with "/~".
Before you run any commands, run head (
head file
) on every input file to make sure it is there and looks normal.Can you post the command you are using?
Biostars needs minimum 20 characters in each post so I had to add some extra characters. Nothing to do with unix.