choosing reads from a fastq file based on another fastq file
4
0
Entering edit mode
8.6 years ago

I have two fastq files. I would like to choose reads from the second file, if the id of the read matches the id of the read in the first file. My script looks like this :

    read1_in_handle = open("new_read1.fastq","rU")
    read2_handle = open("read2.fastq","rU")
    read2_out_handle = open("new_read2.fastq","w")

    read1_dict = {}
    read1_record_iter = SeqIO.parse(read1_in_handle,"fastq")

    for read_record in SeqIO.parse(read2_handle, "fastq"):
            read1_record = next(read1_record_iter)
            if read_record.id in read1_dict:
                    print >> read2_out_handle,read_record.format("fastq")
                    read1_dict[read1_record.id] = read1_dict[read1_record.id] + 1
            else:
                    read1_dict[read1_record.id] = 1

I am not able to resolve the errors this gives. Pl suggest edits

next-gen • 2.3k views
ADD COMMENT
0
Entering edit mode

what is the error you're getting?

ADD REPLY
1
Entering edit mode
8.6 years ago

You can do this with BBMap's "FilterByName" tool:

filterbyname.sh in=x.fastq names=y.fastq out=z.fastq include=t

But, whether you SHOULD do that or not depends on your goal. Are you, perhaps, trying to fix a situation where you had paired reads in two files, and one of the files is missing some of the reads? If so, they were processed non-optimally; the better solution is to start over with the original raw reads and process them in a way that keeps reads together and maintains the correct order. Can you explain the situation in more detail?

ADD COMMENT
1
Entering edit mode
8.6 years ago
venu 7.1k

If you can go for alternative solution, use the following oneliner

Check starting letters of the each read id in file2 and use first 4 or 5 letters to extract ids.

grep '^@HWI' file2.fastq | xargs -I {} grep {} -A 3 -m 1 file1.fastq > new.fastq
ADD COMMENT
1
Entering edit mode
8.6 years ago

using the join operator:

 join -t '       ' -1 1 -2 1   <(gunzip ids.fastq.gz | paste - - - - | cut -f 1 | LC_ALL=C sort | uniq) <(gunzip -c  reads.fastq.gz | paste - - - - | LC_ALL=C sort -t '       ' -k1,1) | tr "\t" "\n"
ADD COMMENT
0
Entering edit mode
8.6 years ago
st.ph.n ★ 2.7k

Unless you're filtering the reads you're keeping from file 2 by quality, or using another program that requires fastq input, it would be easier to parse a fasta. However, this is how I would change your above code. The output is in fasta format.

def make_dict(seqs):
     d = {}
     for n in range(len(seqs)):
             d[seqs[n].id] = str(seqs[n].seq)
     return d

read1_in_handle = open("new_read1.fastq","rU")
    read2_handle = open("read2.fastq","rU")
    read2_out_handle = open("new_read2.fasta","w")

    read1_records = SeqIO.parse(read1_in_handle,"fastq")
 read2_records = SeqIO.parse(read2_handle, "fastq")

read1_dict = make_dict(read1_records)
read2_dict = make_dict(read2_records)

for i in read2_dict:
     if i in read1_dict:
          print >> read2_out_handle, '>' + i, '\n', read2_dict[i]

read2_out_handle.close()
ADD COMMENT
0
Entering edit mode

I need the output in fastq as I will be later filtering by quality

ADD REPLY
2
Entering edit mode

Unless this is an exercise you are required to do in python you should just use @Brian's/@Venu's suggestions.

ADD REPLY

Login before adding your answer.

Traffic: 2424 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6