Question

Merge FastQ Files

0

Entering edit mode

6.9 years ago

shounak.chakraborty1990 ▴ 20

I have 7 FastQ files and I want to merge them into one in the following way:

>File1 line1

>File1 line2

>File1 line3

>File1 line4

>File2 line1

>File2 line2

>File2 line3

>File2 line4

>File3 line1

>File3 line2

>File3 line3

>File3 line4

>.

>.

>.

>File7 line1

>File7 line2

>File7 line3

>File7 line4

>File1 line5

>File1 line6
>.

>.

Although I came up with a solution using sed and awk, It takes an extremely large amount of time to finish since each of these files are raw reads from an RNA-seq experiment and are in the range of 3.6 GB each.

Are there any better ways to merge fastQ files like this?

Thanks Shounak

RNA-Seq fastQ • 4.5k views

ADD COMMENT • link updated 6.0 years ago by Biostar 20 • written 6.9 years ago by shounak.chakraborty1990 ▴ 20

1

Entering edit mode

Why is the reads' order so important? If the read ID is ordered per file, you can try:

cat *fastq | paste - - - - | sort -k1 | sed 's/\t/\n/g'

If the read ID contains the barcode, you may need to fiddle around.

ADD REPLY • link 6.9 years ago by michael.ante ★ 3.9k

0

Entering edit mode

It is important since I am going to run Kallisto on the merged file and Kallisto estimates the fragment length distribution but it uses only a certain number of reads from the top to do that. So if I use cat then the reads in the other files are not being used to estimate the fragment length. So I need equal representation from all the files are each position in my merged file.

ADD REPLY • link 6.9 years ago by shounak.chakraborty1990 ▴ 20

0

Entering edit mode

Than replace the sort command by shuf. You wouldn't get your requested order but a random one. (Which is AFAIK, more suited for Kallisto or Salmon) [edit] you can also have a look here

ADD REPLY • link 6.9 years ago by michael.ante ★ 3.9k

0

Entering edit mode

Unless there's a quite significant difference between the file then it won't matter whether the fragment length distribution is estimated from a single file or all of them. I mean, presumably these are all of the same library, or else merging them at this level would be problematic for the statistics performed on the quantification.

ADD REPLY • link 6.9 years ago by Devon Ryan 104k

0

Entering edit mode

Check this older thread out

Be sure to check the replies within it. Also keep in mind you can merge the data after alignment as bam files with SAMtools.

ADD REPLY • link 6.9 years ago by lshepard ▴ 480

0

Entering edit mode

The previous thread talks mostly about using cat and glob patterns. There is also a mention for a particular tool but it does not seem to merge in his specific way. Regarding alignment, yeah I can merge it later but I need to merge the fastQ files before alignment :)

ADD REPLY • link 6.9 years ago by shounak.chakraborty1990 ▴ 20

0

Entering edit mode

Unless I am missing something this looks like straight concatenation of the files. Why do you need sed/awk for this?

ADD REPLY • link 6.9 years ago by GenoMax 147k

0

Entering edit mode

Using cat on the files results a file with all the reads from file1 followed by the reads from file2 and so on and so forth.

However I want the first seven reads from in my merged file to be the first reads from all the seven files, the second seven reads to be the second reads from the seven files. The problem is that each entry in the fastQ file has four lines. Thats why I had to use awk/sed to convert the entries into one line using a delimiter. Then I used the paste command to get the exact merge and then substituted the delimiters with a newline character. But unfortunately this takes forever.

ADD REPLY • link 6.9 years ago by shounak.chakraborty1990 ▴ 20

0

Entering edit mode

Looking at the example you posted above this was not apparent. See

>File7 line3

>File7 line4

>File1 line1

>File1 line2

You should edit the post above and add this important text there. You could even remove the example altogether.

However I want the first seven reads from in my merged file to be the first reads from all the seven files, the second seven reads to be the second reads from the seven files.

ADD REPLY • link 6.9 years ago by GenoMax 147k

0

Entering edit mode

Those are the last four lines of the example after which I put a couple of dots to indicate that the process continues. It is important to note that one read consists of four lines in a file. The first lines of the example explain the merging process quite clearly.

ADD REPLY • link 6.9 years ago by shounak.chakraborty1990 ▴ 20

2

Entering edit mode

Be easier to do it this way. Reads instead of lines. Fastq record = 4 lines, a standard.

>File7 read3

>File7 read4

>File1 read5

>File1 read6

>File1 read7

>File1 read8

>File2 read5

>File2 read6..

ADD REPLY • link 6.9 years ago by GenoMax 147k

score 3 · Accepted Answer · 2017-12-22

3

Entering edit mode

6.9 years ago

WouterDeCoster 47k

A solution in Python(3). Save as interleave_fqs.py and execute as

python interleave_fqs.py file1.fastq file2.fastq .... fileN.fastq > mynewfile.fastq

Can take an arbitrary number of fastq files which should not have the same length. Requires biopython. Will pick one read from each file until that file is emptied.

from Bio import SeqIO
import sys

fqs = [SeqIO.parse(f, "fastq") for f in sys.argv[1:]]
while True:
    for fq in fqs:
        try:
            print(next(fq).format("fastq"), end="")
        except StopIteration:
            fqs.remove(fq)
    if len(fqs) == 0:
        break