Hi Yannick,
I also faced you problem when dealing with pair-end data. And following Keith's idea, I wrote a simple python script to achieve this goal. (Here I only select the first appearance for duplicated reads with same sequences in both directions.)
The script need library of Biopython
and itertools
.
Hope it could also help others.
#!/usr/bin/env python
# yupf05@gmail.com
from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import itertools
import os,sys, argparse
def ParseArg():
p=argparse.ArgumentParser( description = 'Remove duplicated reads which have same sequences for both forward and reverse reads. Choose the one appears first.', epilog = 'Library dependency: Bio, itertools')
p.add_argument('input1',type=str,metavar='reads1',help='forward input fastq/fasta file')
p.add_argument('input2',type=str,metavar='reads2',help='reverse input fastq/fasta file')
if len(sys.argv)==1:
print >>sys.stderr,p.print_help()
exit(0)
return p.parse_args()
def Main():
Unique_seqs=set()
args=ParseArg()
outfile1 = open("Rm_dupPE_"+args.input1,"w")
outfile2 = open("Rm_dupPE_"+args.input2,"w")
fastq_iter1 = SeqIO.parse(open(args.input1),"fastq")
fastq_iter2 = SeqIO.parse(open(args.input2),"fastq")
for rec1, rec2 in itertools.izip(fastq_iter1, fastq_iter2):
if str((rec1+rec2).seq) not in Unique_seqs:
SeqIO.write(rec1,outfile1,"fastq")
SeqIO.write(rec2,outfile2,"fastq")
Unique_seqs.add(str((rec1+rec2).seq))
outfile1.close()
outfile2.close()
if __name__ == '__main__':
Main()
Keith's suggestion works, but fastx_collapser actually needs to be replaced by brentp's code (See Is There A Fastq Alternative To Fastx_Collapser (Outputs Fasta)? ) and galaxy's fastq joiner is in slow as hell python.
So Keith's suggestion works - but galaxy's fastq joiner/splitter are slow as hell.