Identify Identical Sequences In Fastq File
2
2
Entering edit mode
12.3 years ago

I have a FASTQ file with a lot of reads. I expect sets of identical sequences: in fact I will be counting for occurrences of each unique sequence.

I am using Python and Biopython, and am trying to optimize this problem for a large file. I was wondering if there are any suggestions on how to do this?

What I have so far includes a fast Biopython iterator, and MD5 hashes

    for title,seq,quals in FastqGeneralIterator(file_read_handle) : 
        seq_digest = md5.new(seq).digest
        if seq_digest in list_digest:
              ...
         else
           list_digest.append(seq_digest)
          ...

Is there any other technique for searching for exact sequence matches which might be more efficient?

Thanks very much.

fastq biopython • 6.9k views
ADD COMMENT
0
Entering edit mode

you may want to check for reverse complement as well

ADD REPLY
4
Entering edit mode
12.3 years ago

Right off the bat make sure that you use a set and not a list as your storage datastructure. List lookups are linear and you program won't run even for small files.

A simple command line tool would be fastx_collapser

You can read a few good suggestions on how to do it in How to remove the same sequences in the FASTA files

Most of the techniques above will work on Fastq files as well.

ADD COMMENT
0
Entering edit mode

Thanks very much! These are good suggestions. FASTX - I have found - when I download the 32-bit version, it only comes in version 0.0.12. Which does not support sanger style quality scores. Weird. But I will try the other things. Thanks very much.

ADD REPLY
2
Entering edit mode

you just need to pass the -Q33 flag to fastx to turn on Sanger mode

ADD REPLY
0
Entering edit mode

THANK you so much!

ADD REPLY
0
Entering edit mode
12.3 years ago
Ryan Thompson ★ 3.6k

If you have a reference to map to, it is better to identify identical reads by mapping location, not by sequence. If all you do is collapse identical sequences, you can still miss sequencing errors and such. If you map allowing an appropriate number of mismatches and then collapse reads mapped to the same location, then you will be more tolerant of sequencing errors.

ADD COMMENT
0
Entering edit mode

Ideally the way this should work is that one first detects the duplicates, then only aligns the non duplicated reads (therefore speeding up the process assuming de-duplication is much faster than other processing), and then perform a second level of de-duplication. Depending on the duplication rate the complexity of doing it may not be worth it.

Another issue to deal with is which read to keep? The one with the highest quality, lowest quality, random etc.

edit: it just occurred to me that for example bwa assigns reads that map to multiple location to just one selected randomly. Therefore one may not be able to de-duplicate by looking at the alignment only.

ADD REPLY

Login before adding your answer.

Traffic: 2896 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6