First, I am brand new to this forum and brand new to RNAseq; I searched the forums for this, but didn't find another question similar enough to answer it.
I have 2 control files and 2 treatment files (RNA sequencing). The files are old enough that they unstranded files and they are not paired end files (hence each 4 are distinct).
I trimmed the files with trimmomatic, and was going to perform alignment with TopHat2 next. Our cluster has all the software installed for Bowtie2, samtools etc...
I downloaded and unzipped UCSC hg18 bowtie indexes here: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
So, the questions.
- Do I need to run all the files through one at a time?
- If I run them through TH2 separately, do I have to specify 4 different output folders for each submission?
I ran into the thread issue with TH2 last night. I specified p=8 threads, and the submission crapped out 1 hour in
Searching for junctions via segment mapping [FAILED] Error: segment-based junction search failed with err =1 Error: could not get read# 9850246 from stream!)
I then specified p=1 and it ran, but took 6 hours....if someone knows a good sbatch parameter list to prevent this, I would greatly appreciate it.
Lastly, I got one warning
Checking for reference FASTA file Warning: Could not find FASTA file /locationofbowtieindexes/hg18.fa)
Do I need to put a genome.fa file there from here? (http://support.illumina.com/sequencing/sequencing_software/igenome.html hg38 link under Homo Sapien)?
This is my current submission script:
#!/bin/sh
SAMPLE_ID=trim_Mitchell_P2D-F2.fastq
GENE_REF=filepathtobowtie2index
P=1 #USE 8 THREADS
tophat2 -o tophat_out -p $P $GENE_REF pathtoRNAsequencefastafile/$SAMPLE_ID
Thanks so much for your help and sorry for my noobness!
Thank you b.nota. Question number 3: I had threads = 8 and got the error. I looked for others who had this issue, and they suggested the single thread to solve it. It then ran without any problems, but took ~ 6 hours.
Thanks for reminding me that hg18 is not hg38. Can you shine any light on the warning that I got, and what TH2 is looking for? I assumed it was a full reference (non-bowtie2 indexed) genome, but I wasn't sure. If that is the case, does that need to be in the same directories as my indexes?
Lastly, it sounds like you are saying that this does need to be 4 runs separately, but that I could loop them in sequence with a script. I wanted to make sure that TH2 couldn't align all 4 at once somehow more efficiently before I submitted 4 separate jobs.
Thanks again.
I am not an expert in TopHat, just another user, but I always use the maximum of threads available on my machine. Never get errors like yours...
There should be a genome.fa or hg38.fa file in your index folder, or a link to it. If you downloaded it from igenome website it should be in there.
I always use a loop when I want to map my fq files all at once (in one go that is). You can try to map them in parallel, I never tried that.
You could also run bowtie2-inspect on your index and safe the output as $index.fa:
This way, you get exactly the same naming of chromosomes as in your index, which is not guaranteed if you download a fasta file.
Additionally, Tophat2 results better alignment rates, if you provide the transcriptome index (see the corresponding section here).