TopHat mapping error
1
0
Entering edit mode
5.6 years ago
gokberk ▴ 90

Hi all,

I've been trying to analyze old SOLiD-seq data with TopHat-1.4.1/Bowtie-1.0.1/samtools-0.1.19. Hoever, when I run ./tophat bowtie_index/macaca_fascicularis_5.0_genome fastq/E08/SRR2930200.fastq, I receive the following output:

[Fri Apr  5 17:48:47 2019] Beginning TopHat run (v1.4.1)
-----------------------------------------------
[Fri Apr  5 17:48:47 2019] Preparing output location ./tophat_out/
[Fri Apr  5 17:48:47 2019] Checking for Bowtie index files
[Fri Apr  5 17:48:47 2019] Checking for reference FASTA file
[Fri Apr  5 17:48:47 2019] Checking for Bowtie
    Bowtie version:          1.0.1.0
[Fri Apr  5 17:48:47 2019] Checking for Samtools
    Samtools Version: 0.1.19
[Fri Apr  5 17:48:47 2019] Generating SAM header for ../bowtie_index/macaca_fascicularis_5.0_genome
    format:      fastq
    quality scale:   phred33 (default)
[Fri Apr  5 17:48:49 2019] Preparing reads
    left reads: min. length=51, count=756267
[Fri Apr  5 17:49:05 2019] Mapping left_kept_reads against macaca_fascicularis_5.0_genome with Bowtie 

gzip: stdout: Broken pipe
[Fri Apr  5 17:49:07 2019] Processing bowtie hits
Warning: junction database is empty!
[Fri Apr  5 17:50:55 2019] Processing bowtie hits
    [FAILED]
Error executing: /home/goekberk/tophat-1.4.1.Linux_x86_64/bam_merge ./tophat_out/tmp/left_kept_reads.candidates_and_unspl.bam ./tophat_out/tmp/left_kept_reads.candidates.bam ./tophat_out/tmp/left_kept_reads.unspl.bam

Here are what log files say:

bowtie.left_kept_reads.fixmap.log:

Reads file contained a pattern with more than 1024 quality values. Please truncate reads and quality values and and re-run Bowtie terminate called after throwing an instance of 'int'

long_spanning_read.log:

long_spanning_reads v1.4.1 (exported)
--------------------------------------------
Opening ./tophat_out/tmp/left_kept_reads.bwtout.z for reading
Loading reference sequences...
        reference sequences loaded.
Loading spliced hits...done
Loading junctions...done
Loading deletions...done

prep_reads.log:

prep_reads v1.4.1 (exported)
---------------------------
0 out of 756267 reads have been filtered out

sam_merge.log:

Warning: no input BAM records found.
GList error (GList.hh:970):Invalid list index: 0

Since I'm not familiar with RNAseq data analysis, I'm not sure how to fix this issue. Any help is appreciated.

Cheers!

bowtie tophat rna-seq • 2.3k views
ADD COMMENT
3
Entering edit mode

Are you really sure you need to use TopHat? And such an old version?

Did you look at all the quality score characters used in your fastq? It looks like TopHat might not be handling fastqs based on colorspace correctly.

ADD REPLY
1
Entering edit mode

Go for HISAT2 faster and much better than Tophat V1.

ADD REPLY
0
Entering edit mode

ADD REPLY
0
Entering edit mode

Yeah you might use HISAT2 or STAR instead of Tophat.

ADD REPLY
5
Entering edit mode
5.6 years ago
h.mon 35k

I will try again:

As you have SOLiD reads, you need a colorspace aligner, you should probably use Subread - it is the only currently maintained aligner that supperts colorspace mapping, as far as I know. It is a bad idea converting colorspace to basespace, see Convert colorspace fastq to basespace fastq and references therein.

ADD COMMENT
0
Entering edit mode

Hi h.mon, thanks for your response, I saw this last time, but when I checked the GEO page of the data I'm trying to analyze, I saw that people used these old versions of TopHat and Bowtie to anaylze this dataset previously. So, I thought that I should go for those versions to be safe (I should also mention that I skipped adapter trimming and directly went for mapping). In anycase, I'll try Subread as well, thanks!

ADD REPLY
0
Entering edit mode

So, I downloaded Subread-1.6.4 and compiled it on my server and have another question. I've been trying to generate an index genome using ./subread-buildindex -c -F -o macaca_fascicularis_5.0_index ../../bowtie_index/macaca_fascicularis_5.0_genome.fa command and received the fancy output below:

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      v1.6.4

//================================= setting ==================================\\
||                                                                            ||
||                Index name : macaca_fascicularis_5.0_index                  ||
||               Index space : color space                                    ||
||                    Memory : 8000 Mbytes                                    ||
||          Repeat threshold : 100 repeats                                    ||
||              Gapped index : no                                             ||
||                                                                            ||
||               Input files : 1 file in total                                ||
||                             o macaca_fascicularis_5.0_genome.fa            ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Check the integrity of provided reference sequences ...                    ||
|| No format issues were found                                                ||
|| Scan uninformative subreads in reference sequences ...                     ||

However, it's been stuck at this point for about two hours now, so I was wondering if something is wrong. What is the approximate time for generating an index genome with Subread? The genome assembly I'm indexing is 3GB.

ADD REPLY
1
Entering edit mode

As long as the process is consuming CPU cycles you need to be patient. It can take a while for the index creation on big genomes.

ADD REPLY
0
Entering edit mode

Thanks a lot genomax, I've just checked it, it wasn't using any CPU so I stopped and restarted it.

This is how it looks now. Do you think it looks okayish or does it use an abnormal amount of memory? I'm asking because VIRT and RES numbers looked pretty scary to me, is this the reason why the job dies after a while?

ADD REPLY
0
Entering edit mode

Okay it proceeds now, hopefully will manage generating the index.

        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
      v1.6.4

//================================= setting ==================================\\
||                                                                            ||
||                Index name : macaca_fascicularis_5.0_index                  ||
||               Index space : color space                                    ||
||                    Memory : 8000 Mbytes                                    ||
||          Repeat threshold : 100 repeats                                    ||
||              Gapped index : no                                             ||
||                                                                            ||
||               Input files : 1 file in total                                ||
||                             o macaca_fascicularis_5.0_genome.fa            ||
||                                                                            ||
\\============================================================================//

//================================= Running ==================================\\
||                                                                            ||
|| Check the integrity of provided reference sequences ...                    ||
|| No format issues were found                                                ||
|| Scan uninformative subreads in reference sequences ...                     ||
|| 601617 uninformative subreads were found.                                  ||
|| These subreads were excluded from index building.                          ||
|| Build the index...                                                         ||

Thanks!

ADD REPLY
0
Entering edit mode

How much memory do you have on this machine? It is not unusual to need ~30G of RAM for genomes the size of human. If your job runs out of memory you should see some indication of that.

ADD REPLY
0
Entering edit mode

I guess the total memory is 528361056 K on the server I'm working with.

ADD REPLY
1
Entering edit mode

Then you should be all set. Allow time for the indexing to complete. Check the logs to make sure there were no errors.

ADD REPLY

Login before adding your answer.

Traffic: 2742 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6