How To Parse Fastq File?
6
1
Entering edit mode
11.9 years ago
SK ▴ 110

I am new in NGS. I need to format some fastq files. For example I want to replace characters /1 and /2 with -1 and - 2 in the files having data like @HWUSI-EAS1671_0001:5:1:1022:10290/1. I tried to do it in perl with pattern matching but the files are too big to parse. Could anyone please give me suggestion on share any code?

fastq • 8.7k views
ADD COMMENT
5
Entering edit mode

Well, files are never too big to parse with Perl if you don't keep the file in RAM.

ADD REPLY
0
Entering edit mode

note: as far as I can see, all the examples below assume that it takes four lines for one FASTQ record .

ADD REPLY
0
Entering edit mode

This is generally the case with fastq parsers. I'd be curious if you have an alternative. One of the flaws of fastq in my opinion is that each record is spread over multiple lines with no guaranteed record separator

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

I made a tool to convert the nonstandard multi-line fastq files into 4-line entries. Fastq files should be 4 lines per entry because they are too difficult to parse otherwise. https://sourceforge.net/p/cg-pipeline/code/425/tree/cg_pipeline/branches/lkatz/scripts/run_assembly_convertMultiFastqToStandard.pl

ADD REPLY
3
Entering edit mode

While I agree multi-line fastq is inconvenient, it is not nonstandard as no official documentations invalidate multi-line fastq. Seqtk is probably the most efficient way so far to convert between multi-line and 4-line fastq.

ADD REPLY
0
Entering edit mode

Are there cases where there are fewer lines for a FASTQ record? (I write quickie .fq parsers every now and again, so I'm genuinely curious.)

ADD REPLY
4
Entering edit mode
11.9 years ago
Irsan ★ 7.8k

Many times, you can use UNIX commands do file slicing and dicing tasks.

Consider the file old.fastq

@HWI-ST1019:196:D121WACXX:5:1101:1538:2300/1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@##############################################
@HWI-ST1019:196:D121WACXX:5:1101:1538:2300/2
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@##############################################

you can use the stream editor (sed) to search and replace /1 for -1

sed 's:\/\([12]\):-\1:g' test.fastq

that will give you the output

@HWI-ST1019:196:D121WACXX:5:1101:1538:2300-1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@##############################################
@HWI-ST1019:196:D121WACXX:5:1101:1538:2300-2
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@##############################################
ADD COMMENT
1
Entering edit mode

What happens here if you have '/1' in the quality strings ?

ADD REPLY
0
Entering edit mode

Hmm, that also gets replaced: then use this to change any sequence id that ens with /1 to -1 assuming you have 1 sequence id every 4 lines

awk '{if(NR%4==1){gsub(/\/1$/,"-1");print $0}else{print $0}}' test.fastq

and for /2 to -2

awk '{if(NR%4==1){gsub(/\/2$/,"-2");print $0}else{print $0}}' test.fastq
ADD REPLY
0
Entering edit mode

I personally rather use part of the header, such as "HWI" when trying to match header than relying on 1 sequence always spread over 4 lines.

ADD REPLY
0
Entering edit mode
ADD REPLY
4
Entering edit mode
11.9 years ago
JC 13k

As @lh3 pointed, you don't to read the full file in memory in Perl, just read line by line and process if required:

perl -plane 's/\/(\d)/-$1/ if (m/\@HWI/)' < file.fastq > new_file.fastq

ADD COMMENT
4
Entering edit mode
11.9 years ago

Some code golf for a Friday afternoon? Here is a perl one liner,

perl -pi.bak -e '/^@H/ && s#/([12])$#-\1#' filename

ADD COMMENT
0
Entering edit mode

beaten to the send button

ADD REPLY
3
Entering edit mode
11.9 years ago
toni ★ 2.2k

Hi, Whatever language you choose the files will remain big anyway, but I do not think it's worth to write a C code just for this. I do not know on which operating system you are working on, but supposing that you are under Linux (or equivalent) you can have a look at the sed command.

You must pay attention to the fact that the characters '/', '1' and '2' are in the range of those used for encoding quality. So when you do a substitution, you have to make sure that it is indeed a line of read name and not a quality string.

Take this example (file example.fasq):

@HWI-ST1019:196:/121WACXX:5:1101:1538:2300/1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@######################################/1######

There are 3 '/1' appearances : 2 in the read name, one in the quality string, you only want the one at the end of the read name to be modified.

Then execute (assuming all your read names begin with HW):

sed "/^@HW/ s/\/1$/-1/g" example.fastq

And it produces :

@HWI-ST1019:196:/121WACXX:5:1101:1538:2300-1
CCGCGACCTCTGTTCTGCAGCCCCTTCCCTTCCCCGCCTCCTGCTCTGCCGGGACTACGCACCGGCCTGATTGGTTACCCCCGGGGTGTCCTCGGTCACCA
+
1+++4)<@<A<+2A9A2++:3C8:)1?BDBDBBDC@::6@(.8..7)777:<?@@######################################/1######

Is it REALLY necessary that you read such a huge file just to substitute a '/' with a '-' ?

ADD COMMENT
2
Entering edit mode
11.9 years ago
Nikolay Vyahhi ★ 1.3k

In Python:

for line in open('file.fastq'):
  print line.replace('/1', '-1').replace('/2', '-2')
ADD COMMENT
1
Entering edit mode
11.9 years ago

I find that modules like seqIO in BioPython are convenient for this kind of thing. Especially when you start needing to do other things than just change one character in each header. Of course there are analogues in HTSeq, BioPerl etc.

ADD COMMENT

Login before adding your answer.

Traffic: 1633 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6