I had run exactly the same tophat command on the exacly same files in two different systems just to compare their performance.
tophat -p xx --library-type fr-firststrand -o CT16_frfs -r 150 --mate-std-dev 75 --no-mixed --transcriptome-index ./bwtindex/Transcriptome2 ./bwtindex/Dre_nuclear_2 R1_P.fastq R2_P.fastq
What I see is that the output file sizes are different!!
Details of the run
Run1
- System: Workstation Intel(R) Xeon(R) CPU E5504 @ 2.00GHz
- RAM: 24GB
- OS: Fedora-20 64bit kernel 3.14.4-200
- Used 7 cores.
Output file sizes:
5654399668 accepted_hits.bam
565 align_summary.txt
6790723 deletions.bed
5987915 insertions.bed
19286867 junctions.bed
186 prep_reads.info
1923179692 unmapped.bam
Run2
- System: HPC with a single unit Intel(R) Xeon(R) CPU E7- 8837 @ 2.67GHz
- RAM: 1000GB
- OS: EL6 64bit kernel 2.6.32-279.5.1
- Used 30 cores
Output file sizes:
6617365952 accepted_hits.bam
567 align_summary.txt
6804941 deletions.bed
6000226 insertions.bed
19329598 junctions.bed
186 prep_reads.info
2446203410 unmapped.bam
Finally when I do cuffdiff, post cufflinks and cuffmerge, the list of significantly differentially expressed genes is different (not grossly though. Run2 has more genes (~236) than run1 (~207). I still haven't checked the differences that might arise during cufflinks run (I would have to run cufflinks on run1 files on HPC to see that).
I would not want to stop analysis for this and would proceed with the larger file but I wish to know the reason for this discrepancy?
Oh if I use one core it will take eternity to finish.. The fastq file is huge
You can use a snapshot to try. Considering that you have already done the alignment, try to identify a gene where the two file differ in read counts. Then extract all the reads aligned to those genes in both file and perform the whole analysis with those reads.
Or much simply, if you are using human, then only supply the chrY reference (or part of the reference which is small enough for it to be quick with one thread)