Question

choosing reads from a fastq file based on another fastq file

0

Entering edit mode

8.5 years ago

sumithrasank75 ▴ 140

I have two fastq files. I would like to choose reads from the second file, if the id of the read matches the id of the read in the first file. My script looks like this :

    read1_in_handle = open("new_read1.fastq","rU")
    read2_handle = open("read2.fastq","rU")
    read2_out_handle = open("new_read2.fastq","w")

    read1_dict = {}
    read1_record_iter = SeqIO.parse(read1_in_handle,"fastq")

    for read_record in SeqIO.parse(read2_handle, "fastq"):
            read1_record = next(read1_record_iter)
            if read_record.id in read1_dict:
                    print >> read2_out_handle,read_record.format("fastq")
                    read1_dict[read1_record.id] = read1_dict[read1_record.id] + 1
            else:
                    read1_dict[read1_record.id] = 1

I am not able to resolve the errors this gives. Pl suggest edits

next-gen • 2.3k views

ADD COMMENT • link updated 8.5 years ago by Pierre Lindenbaum 164k • written 8.5 years ago by sumithrasank75 ▴ 140

0

Entering edit mode

what is the error you're getting?

ADD REPLY • link 8.5 years ago by st.ph.n ★ 2.7k

score 1 · Answer 1 · 2016-06-20

You can do this with BBMap's "FilterByName" tool:

filterbyname.sh in=x.fastq names=y.fastq out=z.fastq include=t

But, whether you SHOULD do that or not depends on your goal. Are you, perhaps, trying to fix a situation where you had paired reads in two files, and one of the files is missing some of the reads? If so, they were processed non-optimally; the better solution is to start over with the original raw reads and process them in a way that keeps reads together and maintains the correct order. Can you explain the situation in more detail?

score 1 · Answer 2 · 2016-06-20

1

Entering edit mode

8.5 years ago

venu 7.1k

If you can go for alternative solution, use the following oneliner

Check starting letters of the each read id in file2 and use first 4 or 5 letters to extract ids.

grep '^@HWI' file2.fastq | xargs -I {} grep {} -A 3 -m 1 file1.fastq > new.fastq

ADD COMMENT • link 8.5 years ago by venu 7.1k

score 1 · Answer 3 · 2016-06-20

1

Entering edit mode

8.5 years ago

Pierre Lindenbaum 164k

using the join operator:

 join -t '       ' -1 1 -2 1   <(gunzip ids.fastq.gz | paste - - - - | cut -f 1 | LC_ALL=C sort | uniq) <(gunzip -c  reads.fastq.gz | paste - - - - | LC_ALL=C sort -t '       ' -k1,1) | tr "\t" "\n"

ADD COMMENT • link 8.5 years ago by Pierre Lindenbaum 164k

score 0 · Answer 4 · 2016-06-20

0

Entering edit mode

8.5 years ago

st.ph.n ★ 2.7k

Unless you're filtering the reads you're keeping from file 2 by quality, or using another program that requires fastq input, it would be easier to parse a fasta. However, this is how I would change your above code. The output is in fasta format.

def make_dict(seqs):
     d = {}
     for n in range(len(seqs)):
             d[seqs[n].id] = str(seqs[n].seq)
     return d

read1_in_handle = open("new_read1.fastq","rU")
    read2_handle = open("read2.fastq","rU")
    read2_out_handle = open("new_read2.fasta","w")

    read1_records = SeqIO.parse(read1_in_handle,"fastq")
 read2_records = SeqIO.parse(read2_handle, "fastq")

read1_dict = make_dict(read1_records)
read2_dict = make_dict(read2_records)

for i in read2_dict:
     if i in read1_dict:
          print >> read2_out_handle, '>' + i, '\n', read2_dict[i]

read2_out_handle.close()

ADD COMMENT • link 8.5 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

I need the output in fastq as I will be later filtering by quality

ADD REPLY • link 8.5 years ago by sumithrasank75 ▴ 140

2

Entering edit mode

Unless this is an exercise you are required to do in python you should just use @Brian's/@Venu's suggestions.

ADD REPLY • link 8.5 years ago by GenoMax 148k