comparing two fastq files
1
4
Entering edit mode
9.4 years ago

I have 2 fastq files F1.fastq and F2.fastq. F2.fastq is a smaller file which is a subset of reads from F1.fastq. I want reads in F1.fastq which ARE NOT in F2.fastq. The following python code does not seem to work. Can you suggest edits?

needed_reads = []
reads_array = []
chosen_array = []

for x in Bio.SeqIO.parse("F1.fastq","fastq"):
        reads_array.append(x)

for y in Bio.SeqIO.parse("F2.fastq","fastq"):
        chosen_array.append(y)

for y in chosen_array:
        for x in reads_array:

                if str(x.seq) != str(y.seq) : needed_reads.append(x)

output_handle = open("DIFF.fastq","w")
SeqIO.write(needed_reads,output_handle,"fastq")
output_handle.close()
next-gen sequencing • 9.5k views
ADD COMMENT
1
Entering edit mode

Instead of reinventing the wheel, why not use existing programs? cmpfastq is one such program.

ADD REPLY
1
Entering edit mode

Try it with tiny input files (like 2 records each, with only one in common) and you should be able to track down the bug.

ADD REPLY
1
Entering edit mode

as for the reason it does not work: your program checks each read in file 1 against each read in file 2 and adds it to output every it these two do not match, basically adding the same read many many times to the output

ADD REPLY
0
Entering edit mode

Yes, this is what is happening, but I am unable to get a way to fix this.

ADD REPLY
0
Entering edit mode

Frankly you should look at the results others gave you since this code has many other problems as well regarding its scalability - may not work all that well on large files.

As for your code you need to store the data first in a set and match against it something like this (not tested just typing it in):

results = []
seen = set()
for s in Bio.SeqIO.parse("F1.fastq","fastq"):
    seen.add(str(s.seq))

for s in Bio.SeqIO.parse("F2.fastq","fastq"):
    if str(s.seq) not in seen:
        results.append(x)

output_handle = open("DIFF.fastq","w")
SeqIO.write(results,output_handle,"fastq")
output_handle.close()
ADD REPLY
0
Entering edit mode

Instead of reinventing the wheel, why not use something like:

join -t '\t' -v1 -1 2 -2 2 \
<(gunzip -c f1.gz | paste - - - - | sort -t '\t' -k2,2) \
<(gunzip -c f2.gz | paste - - - - | sort -t '\t' -k2,2)
ADD REPLY
0
Entering edit mode

Hello sumithrasank75!

It appears that your post has been cross-posted to another site: http://stackoverflow.com/questions/30942734/getting-records-which-are-different-from-two-fastq-files

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
3
Entering edit mode
9.4 years ago

With BBMap:

filterbyname.sh in=F1.fastq names=F2.fastq out=filtered.fastq include=f

You can also use dedupe for this with the "uniqueonly" flag if the names are different and you want the unique reads by sequence.

ADD COMMENT
0
Entering edit mode

I need the solution to be within the python environment

ADD REPLY

Login before adding your answer.

Traffic: 2331 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6