Entering edit mode
7.0 years ago
williamsbrian5064
▴
530
I am getting this error when I try to run TopHat on some sequencing data. I was wondering if anyone had any solutions to the problem?
./tophat -p 1 -G dmel-all-r6.18.gtf -o test.bam dmel_genome_6.18 read_1.fastq read_2.fastq
[2017-11-14 14:59:57] Beginning TopHat run (v2.1.0)
-----------------------------------------------
[2017-11-14 14:59:57] Checking for Bowtie
Bowtie 2 not found, checking for older version..
Bowtie version: 1.1.2.0
[2017-11-14 14:59:57] Checking for Bowtie index files (genome)..
[2017-11-14 14:59:57] Checking for reference FASTA file
Warning: Could not find FASTA file dmel_genome_6.18.fa
[2017-11-14 14:59:57] Reconstituting reference FASTA file from Bowtie index
Executing: /Users/kmmeurs/Desktop/Programs/tophat-2.1.0.OSX_x86_64/bowtie-inspect dmel_genome_6.18 > test.bam/tmp/dmel_genome_6.18.fa
[2017-11-14 15:00:07] Generating SAM header for dmel_genome_6.18
[2017-11-14 15:00:07] Reading known junctions from GTF file
[2017-11-14 15:00:12] Preparing reads
[FAILED]
Error running 'prep_reads'
Error: qual length (95) differs from seq length (125) for fastq record !
Here is the header as well for one of the fastq files:
@HISEQ:249:C9MM3ANXX:7:1101:1733:2241 1:N:0:CTATAC
CGACAATCTTGCATGGCCGCGACTTCAGCNNNNNNNNNNNGTTTTTGCGCAATGCCGAACATTGCATGGGATAGGTCGTCGATGCGCCGGAATCCGTGGTCTCGAAATGATCGTCCAACTCAGCC
+
A=3BBGGGGGGGGGGGGGGGGDGGGGGGF###########==<EFGGEGG@GGGEDGGGGGGGCFCGGGD0ECBFGDGGGGGFGGGBGGG@AGG@CGGDEEB@D/6.C8EDEGGGD<EGGGGGGG
@HISEQ:249:C9MM3ANXX:7:1101:1803:2233 1:N:0:CTATAC
CTTAAAATAATTAATGTGTGTATTNNNNNNNNNNNNNNNNNNCACACACTAGAAATATACTTTGCCATCCATTAGGTGAAGGCCTAATCCAAGGCCTCCCTACCATGGATTGGCACAGATAAATT
+
CCCCCGGGGGGGGGGGGGGGGGGG##################===FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGFGEFGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGGGGGGGGGDGGG
@HISEQ:249:C9MM3ANXX:7:1101:1772:2234 1:N:0:CTATAC
TTCTCCTCCTCGGAGTCGCTGTAAANNNNNNNNNNNNNNNNTGACGGCTTTTGTTTACAATCCACCTTCTTTTTAATTTCTTCCTCATTGTAACCCGGAGGTGGAACGGGGGTAAGAGAGCGCCT
docsmb17:tophat-2.1.0.OSX_x86_64 kmmeurs$ head A31P_MYBPC3_Female_1_week_CTATAC_L007_R1_C9MM3ANXX.fastq -C ==> A31P_MYBPC3_Female_1_week_CTATAC_L007_R1_C9MM3ANXX.fastq <==
@HISEQ:249:C9MM3ANXX:7:1101:1733:2241 1:N:0:CTATAC
CGACAATCTTGCATGGCCGCGACTTCAGCNNNNNNNNNNNGTTTTTGCGCAATGCCGAACATTGCATGGGATAGGTCGTCGATGCGCCGGAATCCGTGGTCTCGAAATGATCGTCCAACTCAGCC
+
A=3BBGGGGGGGGGGGGGGGGDGGGGGGF###########==<EFGGEGG@GGGEDGGGGGGGCFCGGGD0ECBFGDGGGGGFGGGBGGG@AGG@CGGDEEB@D/6.C8EDEGGGD<EGGGGGGG
@HISEQ:249:C9MM3ANXX:7:1101:1803:2233 1:N:0:CTATAC
CTTAAAATAATTAATGTGTGTATTNNNNNNNNNNNNNNNNNNCACACACTAGAAATATACTTTGCCATCCATTAGGTGAAGGCCTAATCCAAGGCCTCCCTACCATGGATTGGCACAGATAAATT
+
CCCCCGGGGGGGGGGGGGGGGGGG##################===FGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGFGEFGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGGGGGGGGGDGGG
@HISEQ:249:C9MM3ANXX:7:1101:1772:2234 1:N:0:CTATAC
TTCTCCTCCTCGGAGTCGCTGTAAANNNNNNNNNNNNNNNNTGACGGCTTTTGTTTACAATCCACCTTCTTTTTAATTTCTTCCTCATTGTAACCCGGAGGTGGAACGGGGGTAAGAGAGCGCCT
I saw another post similar to this but I couldn't figure out what they did to fix the problem (https://www.biostars.org/p/110412/). Any help would be fantastic! Thanks!!
The error indicates that something is wrong with your fastq file.
You should know that the old 'Tuxedo' pipeline of Tophat and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. (If you can't get access to that publication, let me know and I'll -cough- help you.) There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.
Is there any way to fix the fastq file? Thanks for the advice by the way! I would have been struggling with "Tuxedo" nonsense for days.
You'll first have to figure out which file is corrupt, and then why. Do you have the original data available? Which steps were taken before this attempted alignment?
I'm not entirely sure about that one. I am helping someone out on a project. They ran samples on an Illumina HiSeq so I'm assuming they got a large file that was then demultiplexed. It looks like the barcodes have been trimmed as well. The files were transferred to my external hard drive and I then transferred the files to my computer.
I could try getting the data again from my colleague?
That's worth trying indeed.
You were right about the file being corrupt. I took it out of the command line and TopHat started working. That is nice to know when I try running HISAT2. Thanks for all the help!
I tried the HISAT, StringTie, and Ballgown method today but I got a bit stuck at the R portion of it. I can't find much about the method really. I was wondering if you had any links?
The paper contains a lot of R code, is that helpful? Or did you already check that?
I tried their R script and got to step 9 and got blocked. They even have the troubleshooting sections that identifies the same error that I'm getting (The Ballgown function results in an error that the first column of pData does not match the names of the folders containing the ballgown data). I couldn't get passed it... I felt like R studio was a bit more corporative which could have given me a bit more problems?
I would suggest opening a separate question, containing your problem, the code you used and the errors you get. Please be as complete as possible.
Try validateFiles from Kent Utilities to find out the broken fastq record.
Does it have to do with the index file? I had to generate my own?
Hi
I am also getting similar error like this when running tophat
Error: qual length (114) differs from seq length (126) for fastq record !
Please suggest some solution. Any help is much appreciated.
Thanks
Please do not use
SUBMIT ANSWER
window unless you are providing an answer to the original question.It looks like your fastq file has at least one record which seems to be malformed (where the number of bases and Q scores don't match). I suggest that you run fastQValidator.