Modification of fastq header
2
0
Entering edit mode
7.6 years ago
seta ★ 1.9k

Hi all,

I'm trying to use the script (PAL finder at enter link description here, but it returned me an error "Non-valid paired end read", however, my fastq files are paired. I think the problem is related to fastq header. The fastq header of example data (one of the PE reads) is like below:

@ILLUMINA-545855_0049_FC61RLR:2:1:8899:1514#0/1

and the header of one of my fastq PE reads is here:

@SRR707811.1 FCD0CDRABXX:4:1101:1290:2174/1

Could you please help me how to modify the header of my fastq reads similar to the header of example data?

Thanks in advance

header fastq read modification • 4.7k views
ADD COMMENT
1
Entering edit mode
7.6 years ago
Charles Plessy ★ 2.9k

In pal_finder's source code, one can see that:

Read names from the same pair are required to be identical,

sub validPEread {
    my $title1 = shift;
    my $title2 = shift;
[cut parts for brevity]
    return 0 unless ($title1 eq $title2);

And failures are reported by an error message where /1 and /2 are added to the read names.

if (not validPEread ($title1, $title2, $seq1, $seq2, $qual1, $qual2) ) {
    print "Non-valid paired end read:\n$title1/1:$seq1:$qual1\n$title2/2:$seq2:$qual2\n";
    exit(1);
}

In your reply to shenwei356, you show an error message with read names ending in \1\1 and \2\2. Thus they already differ in your source file. Differing that way is totally valid, but pal_finder was last updated 5 years ago and does no expect this name convention. Perhaps a command such as sed '/^[@+]/s/\\[12]$//' will help you to remove the trailing parts of the read names that makes them differ.

ADD COMMENT
0
Entering edit mode

Thanks, you're right about read name from the same pair must be identical. I manually remove /1 and /2 in the very short data and the script worked, however, the header in the example data of the script is like below:

@ILLUMINA-545855_0049_FC61RLR:2:1:8899:1514#0/1 and @ILLUMINA-545855_0049_FC61RLR:2:1:8899:1514#0/2

But when I changed the end of my header (/1 or /2) to (#0/1 or #0/2), similar to example data, the same error appeared!!, so I have to remove /1 and /2 from my header, right? could you please let me know how I can do it?

Thank you

ADD REPLY
0
Entering edit mode
7.6 years ago
gzip -d -c old_1.fq.gz | sed 's/ /-/g' | gzip -c > new_1.fq.gz

Replace gzip with pigz if you have pigz, which is much faster.

ADD COMMENT
0
Entering edit mode

Thank you for your reply. I tried with this modification and original header, I don't know why the header changed during runnig the script, it's probably problematic. Actualy th error is:

Non-valid paired end read:
SRR707811.1-FCD0CDRABXX:4:1101:1290:2174/1/1:TCAGCATCAGTGACAGAGGGCCAGCAGAACGAGCAGTGACAAGACAGGTGGGGCCTGGCTCCCCCCCCGCCAGCTCCANNNNNNCCCCTTGCTGCATCTG:eeeddeeeeeeeeecddd\dbdWVdddWcdc_c`bd_dcdeeeaec^c^^cdbddL_]^^ddddUdddVdBBBBBBBBBBBBBBBBBBBBee\eedecee
SRR707811.1-FCD0CDRABXX:4:1101:1290:2174/2/2:TAACTCCTCCTGGGAAAATAATCCTGTTGGAGTTGGGGGCTCTTCCCAGTTGTCTGGTTAGTTGGCCCAGGAAGGGGCAG:dae\ddddcd\ddddefdWffegffdefbd`bZ\c`O_]ZX]]L]b]acbcbZccd\Tdf_]]SbZ^__^fae^BBBBBB

As you see, the end of header changed to 2174/1/1 or 2174/2/2, why /1 or /2 was added?! could you please help me out on this issue? this problem was not with the example data.

ADD REPLY
0
Entering edit mode

Probably there is a space outside the fastq header -

sed '/^@SRR/ s/ /_/g' infile >outfile
ADD REPLY
0
Entering edit mode

Your solution didn't work. The same error was appeared. Any suggestion please!

ADD REPLY
0
Entering edit mode

Can you provide the first four lines of the fastq file, both for the forward and reverse?

ADD REPLY
0
Entering edit mode

Yes, it's here for forward:

@SRR707811.1 FCD0CDRABXX:4:1101:1290:2174/1
TCAGCATCAGTGACAGAGGGCCAGCAGAACGAGCAGTGACAAGACAGGTGGGGCCTGGCTCCCCCCCCGCCAGCTCCANNNNNNCCCCTTGCTGCATCTG
+
eeeddeeeeeeeeecddd\dbdWVdddWcdc_c`bd_dcdeeeaec^c^^cdbddL_]^^ddddUdddVdBBBBBBBBBBBBBBBBBBBBee\eedecee

and for reverse:

@SRR707811.1 FCD0CDRABXX:4:1101:1290:2174/2
TAACTCCTCCTGGGAAAATAATCCTGTTGGAGTTGGGGGCTCTTCCCAGTTGTCTGGTTAGTTGGCCCAGGAAGGGGCAG
+
dae\ddddcd\ddddefdWffegffdefbd`bZ\c`O_]ZX]]L]b]acbcbZccd\Tdf_]]SbZ^__^fae^BBBBBB
ADD REPLY
0
Entering edit mode

Having spaces in fastq headers may be another issue. If you had fastq-dumped this data using -F option (to recover original Illumina headers) you would not have the extra SRR707811.1 bit in your headers.

ADD REPLY

Login before adding your answer.

Traffic: 1849 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6