Question

comparing two fastq files

4

Entering edit mode

9.4 years ago

sumithrasank75 ▴ 140

I have 2 fastq files F1.fastq and F2.fastq. F2.fastq is a smaller file which is a subset of reads from F1.fastq. I want reads in F1.fastq which ARE NOT in F2.fastq. The following python code does not seem to work. Can you suggest edits?

needed_reads = []
reads_array = []
chosen_array = []

for x in Bio.SeqIO.parse("F1.fastq","fastq"):
        reads_array.append(x)

for y in Bio.SeqIO.parse("F2.fastq","fastq"):
        chosen_array.append(y)

for y in chosen_array:
        for x in reads_array:

                if str(x.seq) != str(y.seq) : needed_reads.append(x)

output_handle = open("DIFF.fastq","w")
SeqIO.write(needed_reads,output_handle,"fastq")
output_handle.close()

next-gen sequencing • 9.5k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.4 years ago by sumithrasank75 ▴ 140

1

Entering edit mode

Instead of reinventing the wheel, why not use existing programs? cmpfastq is one such program.

ADD REPLY • link 9.4 years ago by komal.rathi ★ 4.1k

1

Entering edit mode

Try it with tiny input files (like 2 records each, with only one in common) and you should be able to track down the bug.

ADD REPLY • link 9.4 years ago by matted 7.8k

1

Entering edit mode

as for the reason it does not work: your program checks each read in file 1 against each read in file 2 and adds it to output every it these two do not match, basically adding the same read many many times to the output

ADD REPLY • link 9.4 years ago by Istvan Albert 102k

0

Entering edit mode

Yes, this is what is happening, but I am unable to get a way to fix this.

ADD REPLY • link 9.4 years ago by sumithrasank75 ▴ 140

0

Entering edit mode

Frankly you should look at the results others gave you since this code has many other problems as well regarding its scalability - may not work all that well on large files.

As for your code you need to store the data first in a set and match against it something like this (not tested just typing it in):

results = []
seen = set()
for s in Bio.SeqIO.parse("F1.fastq","fastq"):
    seen.add(str(s.seq))

for s in Bio.SeqIO.parse("F2.fastq","fastq"):
    if str(s.seq) not in seen:
        results.append(x)

output_handle = open("DIFF.fastq","w")
SeqIO.write(results,output_handle,"fastq")
output_handle.close()

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.4 years ago by Istvan Albert 102k

0

Entering edit mode

Instead of reinventing the wheel, why not use something like:

join -t '\t' -v1 -1 2 -2 2 \
<(gunzip -c f1.gz | paste - - - - | sort -t '\t' -k2,2) \
<(gunzip -c f2.gz | paste - - - - | sort -t '\t' -k2,2)

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.4 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hello sumithrasank75!

It appears that your post has been cross-posted to another site: http://stackoverflow.com/questions/30942734/getting-records-which-are-different-from-two-fastq-files

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.4 years ago by Pierre Lindenbaum 164k

Ram · Answer 1 · 2015-06-19

3

Entering edit mode

9.4 years ago

Brian Bushnell 20k

With BBMap:

filterbyname.sh in=F1.fastq names=F2.fastq out=filtered.fastq include=f

You can also use dedupe for this with the "uniqueonly" flag if the names are different and you want the unique reads by sequence.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.4 years ago by Brian Bushnell 20k

0

Entering edit mode

I need the solution to be within the python environment

ADD REPLY • link 9.4 years ago by sumithrasank75 ▴ 140