How Can I Split Paired End Fastq Files
2
4
Entering edit mode
11.8 years ago
samsara ▴ 630

I have paired end RNA-Seq fastq files

SampleName_R1.fastq
SampleName_R2.fastq

I want to split both fastq to get following files

SampleName_R1-part1.fastq
SampleName_R1-part2.fastq

SampleName_R2-part1.fastq
SampleName_R2-part2.fastq

How can I split SampleName_R1.fastq in such a way that mate of read_X gets into SampleName_R2-part1.fastq if read_X makes its way to SampleName_R1-part1.fastq while splitting. Making it clearer

read_X -> SampleName_R1-part1.fastq
read_P -> readSampleName_R1-part2.fastq

read_X mate -> SampleName_R2-part1.fastq
read_p mate -> SampleName_R2-part2.fastq

How can I achieve this? Are there any tools that splits fastq files in this fashion?

fastq mapping next-gen • 14k views
ADD COMMENT
0
Entering edit mode

Hi, Samsara,

I have the same question as yours. Do you already have an effective way to do this?

Best,

Xiao

ADD REPLY
7
Entering edit mode
11.8 years ago
fo3c ▴ 450

Split both files so that the number of lines in each file is a multiple of 4. For example, to split both files into chunks of 100,000 lines:

$ split -l 100000 SampleName_R1.fastq SampleName_R1_split_
$ split -l 100000 SampleName_R2.fastq SampleName_R2_split_

You can check that it worked by reading the first line of an R1 file and seeing that it matches the first line of the corresponding R2 file.

ADD COMMENT
1
Entering edit mode

Ok was writting my answer at the same time :)

ADD REPLY
4
Entering edit mode
11.8 years ago
toni ★ 2.2k

Normally, if readX is the N *th* record in SampleNameR1.fastq then readXmate is the N th record in SampleName_R2.fastq.

So you can simply split both files taking the same number of lines (multiple of 4) in each fastq file (I am supposing you have 4-lines-style fastq files).

For basic splitting in this fashion, have a look at the Linux command split .

For instance to split your files in 2 : if N is the number of lines of your fastq files (should be the same for both files). You have N/4 fastq records. Take K=(E[N/8] + 1)*4 lines for first part, the rest in second part.

split -l K SampleName_R1.fastq SampleName_R1_part
split -l K SampleName_R2.fastq SampleName_R2_part

More generally, to split in even more fastq files, just give to -l option a multiple of 4, representing the maximum number of lines you want in each file.

T.

ADD COMMENT
1
Entering edit mode

Yes, I have 4-line-style fastq files. I have 95622055 lines in SampleName_R1.fastq and 95269156 lines in SampleName_R2.fastq. Wouldn't this create problem while splitting ?

ADD REPLY
1
Entering edit mode

Yes it will create problem. You need to have the same number of lines in each fastq file, with the matched fastq records written. Where do your fastq files come from ? Is there any preprocessing made on these already ?

ADD REPLY
1
Entering edit mode

They are from Illumina HiSeq machine. I have not made any preprocessing. I am searching fusion genes from these RNA-Seq data. Since compressed single fastq file is about 28GB, I need to split it because I had memory issues using large files with Tophat Fusion. Do you have any idea, how can I remove reads that do not have mates ?

ADD REPLY
1
Entering edit mode

As far as I know, I do not know about an existing tool that would remove orphan reads. So I would do it myself by parsing both files at the same time, and I would remove a record as soon as I do not find any mate. Perhaps you may have more chances to get on answer on this point by opening a new thread (?, something might exist to do this)

ADD REPLY
0
Entering edit mode

I generated only two FASTQ files after processing raw data from sequencer. This eliminate the headache of splitting single FASTQ file, but thanks a lot for your help.

ADD REPLY
1
Entering edit mode

Hi, Toni,

What is the number "E" in your K=(E[N/8] + 1)*4 equations?

Regards, Xiao

ADD REPLY

Login before adding your answer.

Traffic: 2398 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6