(2013-06-09 EDIT: After a comment from a user, I have fixed a bug in the script. The version on GitHub is up-to-date)
This script that I wrote combines two fastq files that have been trimmed and contain orphans. It should be using MUCH less memory than the AWK script for big files:
https://github.com/enormandeau/Scripts/blob/master/fastqCombinePairedEnd.py
It reads one sequence from file1, searches in hash2 is the corresponding sequence has been read yet. If yes, it writes both sequences to files, if not, it adds the sequence to hash1. Then, a sequence is read from file2, with the same pattern until the end of both files.
Here would be an example call to it:
python fastqCombinePairedEnd.py fastq1 fastq2
NOTE: If it can be assumed that sequences in both files are ordered, the script could be made much more memory efficient at the expense of a bit more computation. For example, if we read sequence X in fastq1 and that it is also in fastq2, it could be assumed that all sequences that have been added to hash1 before can be flushed. This would lead to a very low memory footprint.
Thanks for great script! Very fast implementation. However, I would like to add that for me it works only in python 2.7.3 environment.
My pleasure :) Do you mean it does not work with Python 3 (which it is not supposed to) or that it does not work in older environments (like Python 2.6)?
Works a treat,
Thanks Eric!
My pleasure to see that this script is still being used by new people.
Thanks for the script
This script is executing fine on my laptop...gives perfect results when I execute it on my laptop. But when I try to execute it on the server, it generates blank files. I don't know why. The operating system of my server is Debian. Do have any Idea like how can I fix this??
The script was written for Python 2.7. There is a possibility it will not work if you are using Python 3.x the server. You can test this with
which python
.It doesn't work for me.
I run
python fastqCombinePairedEnd.py fastq1 fastq2
. My fastq1 is 4.6 MB and fastq2 is 4.5MB. The command run quickly. Two output files (_pairs_R1.fastq
and_pairs_R2.fastq
) do not have sequences (0 bytes). Thexx_singles.fastq
is 9.2 MB.For more information:
The header of 1st sequence in fastq1:
The header of 1st sequence in fastq2:
As I replied to you on GitHub: If you launch the command with " " at the end, it will tell the script to use a space as a separator. Your name format is a bit strange so possibly you'll need to test some other options. Please report if this solves your problem. If not, please contact me by email with sample files (~100 sequences per file) so we can find a solution.
The new script you sent to me by email works when I use " " as a separator. Thank you very much for your help!