Entering edit mode
9.5 years ago
sumithrasank75
▴
140
I have 2 fastq files F1.fastq and F2.fastq. F2.fastq is a smaller file which is a subset of reads from F1.fastq. I want reads in F1.fastq which ARE NOT in F2.fastq. The following python code does not seem to work. Can you suggest edits?
needed_reads = []
reads_array = []
chosen_array = []
for x in Bio.SeqIO.parse("F1.fastq","fastq"):
reads_array.append(x)
for y in Bio.SeqIO.parse("F2.fastq","fastq"):
chosen_array.append(y)
for y in chosen_array:
for x in reads_array:
if str(x.seq) != str(y.seq) : needed_reads.append(x)
output_handle = open("DIFF.fastq","w")
SeqIO.write(needed_reads,output_handle,"fastq")
output_handle.close()
Instead of reinventing the wheel, why not use existing programs? cmpfastq is one such program.
Try it with tiny input files (like 2 records each, with only one in common) and you should be able to track down the bug.
as for the reason it does not work: your program checks each read in file 1 against each read in file 2 and adds it to output every it these two do not match, basically adding the same read many many times to the output
Yes, this is what is happening, but I am unable to get a way to fix this.
Frankly you should look at the results others gave you since this code has many other problems as well regarding its scalability - may not work all that well on large files.
As for your code you need to store the data first in a set and match against it something like this (not tested just typing it in):
Instead of reinventing the wheel, why not use something like:
Hello sumithrasank75!
It appears that your post has been cross-posted to another site: http://stackoverflow.com/questions/30942734/getting-records-which-are-different-from-two-fastq-files
This is typically not recommended as it runs the risk of annoying people in both communities.