Question

Combining The Paired Reads From Illumina Run

9

Entering edit mode

13.7 years ago

Surya Saha ▴ 260

Hi,

I have two fastq files with the forward(/1) and reverse(/2) paired reads. The reads are not in same order in either file, some pairs are absent/missing and the files are 8 GB each with abt 30 mill reads each.

I am trying to pull out all the paired reads for which both fwd and rev exist and running out of memory using a Bioperl script. Are there any C or C++ based efficient tools out there for doing this? Any algorithm ideas where I don't need to read in both read files into memory and can just parse through them.

Thanks!

-S.

paired bioperl scripting • 23k views

ADD COMMENT • link updated 11.5 years ago by Eric Normandeau 11k • written 13.7 years ago by Surya Saha ▴ 260

Ram · Answer 1 · 2013-06-05

16

Entering edit mode

11.5 years ago

Eric Normandeau 11k

(2013-06-09 EDIT: After a comment from a user, I have fixed a bug in the script. The version on GitHub is up-to-date)

This script that I wrote combines two fastq files that have been trimmed and contain orphans. It should be using MUCH less memory than the AWK script for big files:

https://github.com/enormandeau/Scripts/blob/master/fastqCombinePairedEnd.py

It reads one sequence from file1, searches in hash2 is the corresponding sequence has been read yet. If yes, it writes both sequences to files, if not, it adds the sequence to hash1. Then, a sequence is read from file2, with the same pattern until the end of both files.

Here would be an example call to it:

python fastqCombinePairedEnd.py  fastq1  fastq2

NOTE: If it can be assumed that sequences in both files are ordered, the script could be made much more memory efficient at the expense of a bit more computation. For example, if we read sequence X in fastq1 and that it is also in fastq2, it could be assumed that all sequences that have been added to hash1 before can be flushed. This would lead to a very low memory footprint.

ADD COMMENT • link 11.5 years ago by Eric Normandeau 11k

0

Entering edit mode

Thanks for great script! Very fast implementation. However, I would like to add that for me it works only in python 2.7.3 environment.

ADD REPLY • link 11.2 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

My pleasure :) Do you mean it does not work with Python 3 (which it is not supposed to) or that it does not work in older environments (like Python 2.6)?

ADD REPLY • link 11.2 years ago by Eric Normandeau 11k

0

Entering edit mode

Works a treat,

Thanks Eric!

ADD REPLY • link 9.7 years ago by Jonathan Crowther ▴ 210

1

Entering edit mode

My pleasure to see that this script is still being used by new people.

ADD REPLY • link 9.5 years ago by Eric Normandeau 11k

0

Entering edit mode

Thanks for the script

ADD REPLY • link 9.2 years ago by dmenning • 0

0

Entering edit mode

This script is executing fine on my laptop...gives perfect results when I execute it on my laptop. But when I try to execute it on the server, it generates blank files. I don't know why. The operating system of my server is Debian. Do have any Idea like how can I fix this??

ADD REPLY • link 8.1 years ago by s.singh ▴ 70

1

Entering edit mode

The script was written for Python 2.7. There is a possibility it will not work if you are using Python 3.x the server. You can test this with which python.

ADD REPLY • link 8.1 years ago by Eric Normandeau 11k

0

Entering edit mode

It doesn't work for me.

$ python --version
Python 2.7.13

$ which python
/Library/Frameworks/Python.framework/Versions/2.7/bin/python

I run python fastqCombinePairedEnd.py fastq1 fastq2. My fastq1 is 4.6 MB and fastq2 is 4.5MB. The command run quickly. Two output files (_pairs_R1.fastq and _pairs_R2.fastq) do not have sequences (0 bytes). The xx_singles.fastq is 9.2 MB.

For more information:

The header of 1st sequence in fastq1:

@M03670:6:000000000-BCN4V:1:1101:16480:1645_SUB_CMP reverse_score=80.0; status=partial; direction=reverse; tail_quality=39.0; reverse_match=ccacctatcacataatcatg; seq_length_ori=151; seq_length=121; sample=BI_0_1_1; experiment=eDNA; location=BI_0; mid_quality=38.5267175573; head_quality=33.3; avg_quality=38.2119205298; grab=BI_0_1; reverse_primer=ccacctatcacayaatcatg;   1:N:0:1

The header of 1st sequence in fastq2:

@M03670:6:000000000-BCN4V:1:1101:16480:1645_SUB_CMP reverse_score=76.0; status=partial; direction=reverse; tail_quality=38.5; reverse_match=aagggcaccacaagaacgc; seq_length_ori=151; seq_length=122; sample=BI_0_1_1; experiment=eDNA; location=BI_0; mid_quality=38.3893129771; head_quality=34.6; avg_quality=38.1456953642; grab=BI_0_1; reverse_primer=aagggcaccacaagaacgc;   2:N:0:1

ADD REPLY • link updated 5.2 years ago by Ram 44k • written 7.2 years ago by hexphe ▴ 10

0

Entering edit mode

As I replied to you on GitHub: If you launch the command with " " at the end, it will tell the script to use a space as a separator. Your name format is a bit strange so possibly you'll need to test some other options. Please report if this solves your problem. If not, please contact me by email with sample files (~100 sequences per file) so we can find a solution.

ADD REPLY • link 7.2 years ago by Eric Normandeau 11k

1

Entering edit mode

The new script you sent to me by email works when I use " " as a separator. Thank you very much for your help!

ADD REPLY • link 7.2 years ago by hexphe ▴ 10

score 7 · Answer 2 · 2011-03-29

7

Entering edit mode

13.7 years ago

lh3 33k

I have not tried, but something following this line:

awk '{printf substr($0,1,length-2);getline;printf "\t"$0;getline;getline;print "\t"$0}' read1.fq | sort -S 8G -T. > read1.txt
awk '{printf substr($0,1,length-2);getline;printf "\t"$0;getline;getline;print "\t"$0}' read2.fq | sort -S 8G -T. > read2.txt
join read1.txt read2.txt | awk '{print $1"\n"$2"\n+\n"$3 > "r1.fq";print $1"\n"$4"\n+\n"$5 > "r2.fq"}'

Basically, you convert fastq to TAB delimited format, sort them by read names, join sorted files together and then convert back to fastq.

EDIT: strip "/[12]" in read names.

ADD COMMENT • link 13.7 years ago by lh3 33k

1

Entering edit mode

"join" discards orphans. If you want to keep them, you can output orphan lines with "join -a". Writing your own script to join is also easy given two sorted files.

ADD REPLY • link 13.7 years ago by lh3 33k

0

Entering edit mode

how does this deal with orphans?

ADD REPLY • link 13.7 years ago by Jeremy Leipzig 22k

0

Entering edit mode

I have a quite similar problem. Currently, I sort the file and reads one line by one line to do pairing, but that is time and memory consuming.

ADD REPLY • link 13.5 years ago by Gangcai ▴ 230