With more and more genomic projects, we get tons of sequences from next generation sequencing in the lab, mostly from solid 454. I am looking for a way to automatically remove adaptors from these sequences.
The problem is rendered more difficult for a few reasons:
- The adaptor sequence can sometimes be only partially present.
- It can be present multiple times on end (with certain preparation methodologies).
- There are (obviously) many sequences, up to a few millions.
- Both the .fasta and .qual files need to be modified.
As of now, I have not found a better approach then to do a custom program in Python. The approach I have implemented works, but I still would like to know what you use for that purpose. The main problem I find with this approach is that it searches for a sequence using a degenerative process on the adaptors, rather than doing a blast per se.
Can you suggest a program that you have experience with and that would solve this problem?
Many thanks!
Hi @Jarretinha The part I am less confident in is exactly the regex part. I feel what is really needed is a form of blast, not a degenerate regex search. I may be mistaken. Maybe I do not see how to make the best use of regex... How would you tackle the problem, using regexes, of searching for short sequences (15-30 pb) that may be incomplete and contain insertions or deletions? Tanks
Regexes are only useful when you know what are you looking for. For a given edit distance I know that is possible to generate the sequence subset and map it to a regex. It's kind of a hash table of regexes. This way you can reduce the degeneracy. The table will be much smaller than the sequence set and can be used against a lage chunk of sequences (instead of one read at a time). I've never compared this approach to blast/SW. Anyway, blast2 will certainly be way faster.