trouble parsing problem filterbytile
1
0
Entering edit mode
4.1 years ago

Hello I have a problem with filterbytile : I try to analyze the data from GSE52778 I use filterbytile.sh and It seems I have a "Trouble parsing problem"

when I try : f

ilterbytile.sh in1=SRR1039508_1.fastq in2=SRR1039508_2.fastq out1=SRR1039508_1_filtre.fastq out2=SRR1039508_2_filtre.fastq

I obtain :

java -ea -Xmx6678m -Xms6678m -cp /XXXX/XXXX/bin/bbmap/current/ hiseq.AnalyzeFlowCell in1=SRR1039508_1.fastq in2=SRR1039508_2.fastq out1=SRR1039508_1_filtre.fastq out2=SRR1039508_2_filtre.fastq
Executing hiseq.AnalyzeFlowCell [in1=SRR1039508_1.fastq, in2=SRR1039508_2.fastq, out1=SRR1039508_1_filtre.fastq, out2=SRR1039508_2_filtre.fastq]

Set INTERLEAVED to false
Loading kmers:      205.432 seconds.
Filling tiles:      Trouble parsing header SRR1039508.1.1 HWI-ST177:290:C0TECACXX:1:1101:1225:2130 length=63
java.lang.AssertionError: SRR1039508.1.1 HWI-ST177:290:C0TECACXX:1:1101:1225:2130 length=63
    at hiseq.IlluminaHeaderParser.parseInt(IlluminaHeaderParser.java:149)
    at hiseq.IlluminaHeaderParser.parseCoordinates(IlluminaHeaderParser.java:71)
    at hiseq.IlluminaHeaderParser.parse(IlluminaHeaderParser.java:55)
    at hiseq.FlowCell.getMicroTile(FlowCell.java:144)
    at hiseq.AnalyzeFlowCell.fillTilesInner(AnalyzeFlowCell.java:641)
    at hiseq.AnalyzeFlowCell.fillTiles(AnalyzeFlowCell.java:380)
    at hiseq.AnalyzeFlowCell.process(AnalyzeFlowCell.java:316)
    at hiseq.AnalyzeFlowCell.main(AnalyzeFlowCell.java:51)

thanks for your help

C

software error • 1.4k views
ADD COMMENT
0
Entering edit mode

I did it (fatq-dump -F XXXXXXXXX) I i obtain the same thing.......

Loading kmers: 7.484 seconds.

**Filling tiles:    Trouble parsing header HWI-ST177:290:C0TECACXX:1:1101:1225:2130**

java.lang.StringIndexOutOfBoundsException: String index out of range: 40
    at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:47)
    at java.base/java.lang.String.charAt(String.java:693)
    at hiseq.IlluminaHeaderParser.goBackSeveralColons(IlluminaHeaderParser.java:133)
    at hiseq.IlluminaHeaderParser.parseCoordinates(IlluminaHeaderParser.java:70)
    at hiseq.IlluminaHeaderParser.parse(IlluminaHeaderParser.java:55)
    at hiseq.FlowCell.getMicroTile(FlowCell.java:144)
    at hiseq.AnalyzeFlowCell.fillTilesInner(AnalyzeFlowCell.java:641)
    at hiseq.AnalyzeFlowCell.fillTiles(AnalyzeFlowCell.java:380)
    at hiseq.AnalyzeFlowCell.process(AnalyzeFlowCell.java:316)
    at hiseq.AnalyzeFlowCell.main(AnalyzeFlowCell.java:51)
ADD REPLY
1
Entering edit mode
4.1 years ago
GenoMax 147k

It appears that this data has been submitted to SRA stripping the index information from the header. So to get around that you should first use reformat.sh to add 1: and 2: to the fastq headers.

You can do that by

 reformat.sh addcolon=t in1=SRR1039508_1.fastq in2=SRR1039508_2.fastq out1=test1.fastq out2=test2.fastq

That will give you

@HWI-ST177:290:C0TECACXX:1:1101:2225:2087 1:
AACAAGAAGAGTTCTCTGAAAGGCAATGAGAAAGAGAAGGAGAAACAACAGCGGGAGAAGGAT
+
HJJJJJJJJJJFIIIJJJJJJJJIJJJJHJJJJJJJJJJJJJJJIJJJJIJJHHFDDDDDDDD

You can then run filterbytile.sh with these intermediate files.

filterbytile.sh in1=test1.fastq in2=test2.fastq out1=final_R1.fastq.gz out2=final_R2.fastq.gz
ADD COMMENT
0
Entering edit mode

Maybe this info will be helpful for those using SRA files. In case of header like

@SRR1030982.1 HWI-ST1143-137:4:1101:1069:2548 length=96

following shell command corrects it so it is accepted by filterbytile.sh

sed -r 's/^@SRR[0-9]{7}\.[0-9]* /@/' x.fastq | sed '/@/s/ length.*$/ 1:/' > xrepair.fastq

and as a result we have

@HWI-ST1143-137:4:1101:1069:2548 1:

It is for single-end data, so it should be slightly modified for paired-end (addition of 2 instead of 1 for second read from pair). Very possible it is not optimal command but it works.

Thanks GenoMax for inspiration

ADD REPLY
0
Entering edit mode

For the example you post (SRR1030982), using -F option when you dump the reads out should restore Illumina format headers. No editing should be required.

ADD REPLY
0
Entering edit mode

Oh, I thought that changing header is a 'feature' of fasterq-dump from SRA toolkit and it is hardcoded. Thanks! I'll remember about it next time.

ADD REPLY
0
Entering edit mode

As a complement I want to add that F option would not help

The manual states

Usage:
  fasterq-dump <path> [options]
  fasterq-dump <accession> [options]

Options:
  -F|--format                      format (special, fastq, default=fastq)

so we can choose fastq or some different format and the former is default. And there is no other option related to header. In fact the whole concept of SRA files look for me overly complicated and not usable.

ADD REPLY
0
Entering edit mode

The -F option is for fastq-dump program (not fasterq-dump). I should have explicitly added that above.

ADD REPLY

Login before adding your answer.

Traffic: 1932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6