Removing Adaptors From .Fasta And .Qual Files
5
2
Entering edit mode
14.6 years ago

With more and more genomic projects, we get tons of sequences from next generation sequencing in the lab, mostly from solid 454. I am looking for a way to automatically remove adaptors from these sequences.

The problem is rendered more difficult for a few reasons:

  • The adaptor sequence can sometimes be only partially present.
  • It can be present multiple times on end (with certain preparation methodologies).
  • There are (obviously) many sequences, up to a few millions.
  • Both the .fasta and .qual files need to be modified.

As of now, I have not found a better approach then to do a custom program in Python. The approach I have implemented works, but I still would like to know what you use for that purpose. The main problem I find with this approach is that it searches for a sequence using a degenerative process on the adaptors, rather than doing a blast per se.

Can you suggest a program that you have experience with and that would solve this problem?

Many thanks!

fasta adaptor sequence • 8.0k views
ADD COMMENT
7
Entering edit mode
14.6 years ago

Outside proprietary realm I know very few instances of adaptor trimmers/removers.

Of course, there is Brad Chapman's blog entry (highly recommended).

I've played a little bit with SeqTrim which is a complete pipeline for seq preprocessing. I'm not sure about the file modification part.

And you can use ShortRead from Bioconductor.

But, I didn't get the gist of your question. if you know how to read a fasq file in a language with a powerful regex API what do you need more? Once I made a low quality region seeker in bioperl. The purpose was to find contiguous regions of low quality score and to mask the associated sequence if it was in the middle of ORESTES reads or remove it if it was in the ends. Your problem is similar.

ADD COMMENT
0
Entering edit mode

Hi @Jarretinha The part I am less confident in is exactly the regex part. I feel what is really needed is a form of blast, not a degenerate regex search. I may be mistaken. Maybe I do not see how to make the best use of regex... How would you tackle the problem, using regexes, of searching for short sequences (15-30 pb) that may be incomplete and contain insertions or deletions? Tanks

ADD REPLY
0
Entering edit mode

Regexes are only useful when you know what are you looking for. For a given edit distance I know that is possible to generate the sequence subset and map it to a regex. It's kind of a hash table of regexes. This way you can reduce the degeneracy. The table will be much smaller than the sequence set and can be used against a lage chunk of sequences (instead of one read at a time). I've never compared this approach to blast/SW. Anyway, blast2 will certainly be way faster.

ADD REPLY
4
Entering edit mode
14.6 years ago
Nico ▴ 190

We use the fastx_toolkit, developped in the Hannon lab at CSHL (mostly for Illumina reads, I believe): http://hannonlab.cshl.edu/fastx_toolkit/

and the command fastx_trimmer, which works on fasta or fastq. You might have to do some tricks to get it work on fasta+qual files, but that seems like a good start.

For the issue of multiple occurences (first time I hear of that), could you run the program multiple times (until you don't find it anymore)?

ADD COMMENT
3
Entering edit mode
13.5 years ago
Weronika ▴ 300

I'd suggest cutadapt. You could also check out how the python HTSeq package deals with the problem

Also see this question: How To Best Deal With Adapter Contamination (Illumina)?

ADD COMMENT
1
Entering edit mode
14.6 years ago

When searching for short adaptor sequences over a large number of records a good choice may be generating all acceptable configurations of these adaptors then doing a lookup for them in each iteration.

Regular expression, fuzzy matching or trying to find the differences may be too time consuming.

ADD COMMENT
1
Entering edit mode

BLAT uses this approach. But it's only feasible for very short reads and small edit distance. I always get stucked on the "generate the set of all possibilities".

ADD REPLY
0
Entering edit mode

@Istvan Albert This comment reassures me, since it is exactly the approach I have implemented. Blasting is then a bad idea? I started with the premise that it would be the best approach. Cheers

ADD REPLY
0
Entering edit mode
14.6 years ago
Paulo Nuin ★ 3.7k

I won't give you code, only some ideas.

  • you can use some fasta python parser that allows you to analyse sequence by sequence, something with a yield or something that checks sequence by sequence
  • either way (or language) something in parallel or threaded is the key for you. Each core/thread might be able to check one sequence at a time.
  • for each sequence, use a regex to check for the adaptor sequence on the beginning or end of your sequence

That will give you some initial idea on how to do it, if you want to do it yourself.

ADD COMMENT
0
Entering edit mode

Hi @nuin I will clarify the question, but in essence, this is what I have ALREADY done. My question is rather: What do YOU use, maybe more in term of a program already existing out there. Multi-threading could be fun to implement, but the process is already not to slow (less than 10 minutes) for around 1.5 million sequences.

ADD REPLY
0
Entering edit mode

Hi @nuin I have clarified the question, but in essence, this is what I have already done. My question is rather: What do YOU use, maybe more in term of a program already existing out there. Multi-threading could be fun to implement, but the process is already not to slow (less than 10 minutes) for around 1.5 million sequences.

ADD REPLY
0
Entering edit mode

I don't deal with NGS data in my projects, so I cannot really advise anything in that regard. Seeing your comment below, a regex can remove (or at least find) your adaptors, just be careful by checking their position in your read.

ADD REPLY

Login before adding your answer.

Traffic: 2144 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6