HOw to modify the description line 1 of a fastaq file to match with third line?
4
0
Entering edit mode
7.0 years ago
majeedaasim ▴ 60

I am using paired end SRA data, it looks like this

@HWI-ST915_0064:2:1101:1420:2104/1
GTCTCTTCGCACGCTTTCACTGTGAACGGTTCGGCATCGAGAAGGACGCAGTTCCTCTCCGGCTTGGACCAGTTTCTGGTGGCCACGGCTGCCCCCATCC
+SRR1188607.1 HWI-ST915_0064:2:1101:1420:2104 length=100
HDHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHFHFFHHHHHHHHEHEHHHHHHHGHHHED?EE=A@BACDDECCE@DB74?############

@HWI-ST915_0064:2:1101:1498:2108/1
AAAGATTGCAATGGAGGAGAAAGGGAAGACCCTGCCTGAAGAAATGCAATTGATAAATAAGTTGTTGTCTGAGGAAAAGGGTTCGGAGAGGATGAGAATG
+SRR1188607.2 HWI-ST915_0064:2:1101:1498:2108 length=100
GFHHHHHHHHHHHHHHHHHHHHHGHGHHHHHHHHHHGHHHHGHHHHHHDHGGHHFHHHHHGEGGFGGFGFHFHHDHHGHHGFGFGHGFHHHHHHHHHHHH

Othe read file is similar except with /2 at the end

For further analysis I need to match the description in line one with line three after +, but in line three there is extra information as SRR1188607. 1, SRR1188607.2 etc how to get rid of this so that it matches with line one description. Also how to delete everything after +

Thanks

fastq header modification • 2.2k views
ADD COMMENT
1
Entering edit mode

I wonder why you need to do this... most programs ignore the 3rd line...

ADD REPLY
1
Entering edit mode

For example, the biopython parser throws an error if the third line is present and not equal to the first line. AFAIK per convention the third line has to be equal to the first, or absent.

ADD REPLY
1
Entering edit mode
7.0 years ago
awk '(NR%4==3) {i=index($0," ");printf("+%s\n",substr($0,i+1));next;}{print}' input.fastq
ADD COMMENT
1
Entering edit mode
7.0 years ago

Pattern to exclude is

SRR[some_digits][a dot][another digit][a space]


sed combined with regex

sed 's/SRR[0-9]*.[0-9] //g'

Input

[Desktop]$ cat test.fq 
@HWI-ST915_0064:2:1101:1420:2104/1
GTCTCTTCGCACGCTTTCACTGTGAACGGTTCGGCATCGAGAAGGACGCAGTTCCTCTCCGGCTTGGACCAGTTTCTGGTGGCCACGGCTGCCCCCATCC
+SRR1188607.1 HWI-ST915_0064:2:1101:1420:2104 length=100
HDHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHFHFFHHHHHHHHEHEHHHHHHHGHHHED?EE=A@BACDDECCE@DB74?############

Output

[Desktop]$  sed 's/SRR[0-9]*.[0-9] //g' test.fq
@HWI-ST915_0064:2:1101:1420:2104/1
GTCTCTTCGCACGCTTTCACTGTGAACGGTTCGGCATCGAGAAGGACGCAGTTCCTCTCCGGCTTGGACCAGTTTCTGGTGGCCACGGCTGCCCCCATCC
+HWI-ST915_0064:2:1101:1420:2104 length=100
HDHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHFHFFHHHHHHHHEHEHHHHHHHGHHHED?EE=A@BACDDECCE@DB74?############

enter image description here

ADD COMMENT
0
Entering edit mode

This would still lead to line 1 header not matching line 3. So as @Wouter said above will cause problems with some programs.

ADD REPLY
0
Entering edit mode

I agree and indeed the "length=100" causes the problem so doesn't match.

ADD REPLY
0
Entering edit mode

ah! not a big task to remove that as well! sed once again my friend; adding sed 's/length=[0-9]*//' :)

sed 's/SRR[0-9]*.[0-9] //g' test.fq | sed 's/length=[0-9]*//'

[Desktop]$ sed 's/SRR[0-9]*.[0-9] //g' test.fq | sed 's/length=[0-9]*//'
@HWI-ST915_0064:2:1101:1420:2104/1
GTCTCTTCGCACGCTTTCACTGTGAACGGTTCGGCATCGAGAAGGACGCAGTTCCTCTCCGGCTTGGACCAGTTTCTGGTGGCCACGGCTGCCCCCATCC
+HWI-ST915_0064:2:1101:1420:2104 
HDHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHFHFFHHHHHHHHEHEHHHHHHHGHHHED?EE=A@BACDDECCE@DB74?############

Is that /1 an issue? If yes, let me know.

ADD REPLY
0
Entering edit mode

Yes it will be. Line 1 and 3 need to match or line 3 need to be blank (past +) for many tools.

ADD REPLY
0
Entering edit mode
7.0 years ago
Tm ★ 1.1k

To remove anything after '+' from 3rd line, you can use simple sed one liner:

sed "s/+SRR1188607.*/+/g" in_file.fastq >out_file.fastq
ADD COMMENT
0
Entering edit mode

To properly format code you should select the text you want to format first and then click on the '101` button you see in the editor window.

ADD REPLY
0
Entering edit mode

thanks toralmanvar it works But what if the SRR... is present in the the beginning of the first line as

SRR1188607.1@HWI-ST915_0064:2:1101:1498:2108/1

And I want to get rid of these SRR IDs to begin the description line with @HWI- onwards

ADD REPLY
0
Entering edit mode

Dump the reads with -F option.

fastq-dump -F --split-files SRR1188607

and then add 1: and 2: to the read headers by using reformat.sh from BBMap suite.

 reformat.sh in1=SRR1188607_1.fastq in2=SRR1188607_2.fastq out1=R1.fq out2=R2.fq addcolon=t
ADD REPLY
0
Entering edit mode

To replace SRR1188607.1 from the begining, you can use:

sed "s/SRR1188607.1@HWI-ST915/@HWI-ST915/g" in_file.fastq >out_file.fastq

Here SRR1188607.1@HWI-ST915 part will be substituted by @HWI-ST915 in your first header line.

ADD REPLY
0
Entering edit mode

It tried it, but the description was same and there was no impact.

Also one point I shall mention is that the change shall take place in all reads and not in a single read as the code you mentioned reflects.

Thanks

ADD REPLY
0
Entering edit mode

Use the method I had posted above.

ADD REPLY
0
Entering edit mode
7.0 years ago
$ sed 's/^\(\+\).*\(HWI.*\)/\1\2/g' test.fq
ADD COMMENT
0
Entering edit mode

This also leads to line 1 header not matching line 3. So as @Wouter said above will cause problems with some programs.

ADD REPLY
0
Entering edit mode

output (for other read file replace /1 with /2):

$ sed 's/^\(\+\).*\(HWI.*\)\s.*/\1\2\/1/g' test.fq

or

 $ sed 's/\bSRR.\{9\} \b//;s/\b len.*\b/\/1/' test.fq 
@HWI-ST915_0064:2:1101:1420:2104/1
GTCTCTTCGCACGCTTTCACTGTGAACGGTTCGGCATCGAGAAGGACGCAGTTCCTCTCCGGCTTGGACCAGTTTCTGGTGGCCACGGCTGCCCCCATCC
+HWI-ST915_0064:2:1101:1420:2104/1
HDHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHFHFFHHHHHHHHEHEHHHHHHHGHHHED?EE=A@BACDDECCE@DB74?############

@HWI-ST915_0064:2:1101:1498:2108/1
AAAGATTGCAATGGAGGAGAAAGGGAAGACCCTGCCTGAAGAAATGCAATTGATAAATAAGTTGTTGTCTGAGGAAAAGGGTTCGGAGAGGATGAGAATG
+HWI-ST915_0064:2:1101:1498:2108/1
GFHHHHHHHHHHHHHHHHHHHHHGHGHHHHHHHHHHGHHHHGHHHHHHDHGGHHFHHHHHGEGGFGGFGFHFHHDHHGHHGFGFGHGFHHHHHHHHHHHH

input:

$ cat test.fq 
@HWI-ST915_0064:2:1101:1420:2104/1
GTCTCTTCGCACGCTTTCACTGTGAACGGTTCGGCATCGAGAAGGACGCAGTTCCTCTCCGGCTTGGACCAGTTTCTGGTGGCCACGGCTGCCCCCATCC
+SRR1188607.1 HWI-ST915_0064:2:1101:1420:2104 length=100
HDHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHHHFHFFHHHHHHHHEHEHHHHHHHGHHHED?EE=A@BACDDECCE@DB74?############

@HWI-ST915_0064:2:1101:1498:2108/1
AAAGATTGCAATGGAGGAGAAAGGGAAGACCCTGCCTGAAGAAATGCAATTGATAAATAAGTTGTTGTCTGAGGAAAAGGGTTCGGAGAGGATGAGAATG
+SRR1188607.2 HWI-ST915_0064:2:1101:1498:2108 length=100
GFHHHHHHHHHHHHHHHHHHHHHGHGHHHHHHHHHHGHHHHGHHHHHHDHGGHHFHHHHHGEGGFGGFGFHFHHDHHGHHGFGFGHGFHHHHHHHHHHHH
ADD REPLY

Login before adding your answer.

Traffic: 2636 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6