Question

Ests Cleanup: Seqclean Alternatives?

2

Entering edit mode

14.1 years ago

Darked89 4.7k

I am mapping a large number of other species ESTs to a draft of novel genome. These EST sequences are contaminated with vectors, adaptors, ribosomal RNA, plagued by stretches of low-complexity sequence artifacts etc. I already run SeqClean, but I am bewildered by its ability to leave i.e. 100 starting nucleotides intact which match @98-99% bunch of known vectors. Same goes for things easy to spot (ribosomal sequences and low complexity). I assumed that such artifacts will not be easy mapped, but somehow GMAP manages to map them anyway in a --tolerant mode.

I did not benchmark it yet, but in the fairly distant past if memory serves me right, I was getting more reliable output using pregap4 from Staden. There is also a new tool called SeqTrim. Has anybody used that one already? Can you recommend anything else?

EDIT: With the default sequence library SeqTrim is more strict than SeqClean (5539 vs 6522 non-zero length sequences out of 6693). Using the same EST set and the same GMAP settings 10464 vs 12992 cDNA_matches (after extra step of removing ribosomal RNA sequences). SeqTrim does cut sometimes reasonably looking EST (i.e. GT153378.1) to zero. On the other hand it kicks out rRNA quite well (just one EST missed vs 52 missed by SeqClean).

sequence est • 4.7k views

ADD COMMENT • link updated 14.0 years ago by Stephanie • 0 • written 14.1 years ago by Darked89 4.7k

0

Entering edit mode

I am trying to use Seqtrim, but after a while i always get an out of memory message, after which the programme shuts down:

Out of memory!
Callback called exit at /software/shared/apps/x86_64/perl/5.8.9/lib/site_perl/5.8.9/Bio/SeqIO.pm line 676, [?] line 40615748.

Does anyone know about a decent tool that uses less memory?

It is however for illumina genome sequence data. Whenever I try Seqtrim for a smaller file, it does work.

anyone?

ADD REPLY • link updated 5.3 years ago by Ram 44k • written 13.5 years ago by Stephanie • 0

score 3 · Answer 1 · 2010-12-02

3

Entering edit mode

14.1 years ago

Haibao Tang 3.0k

I had also used Lucy before for removing vectors and low-quality nucleotides from Sanger reads.

ADD COMMENT • link 14.1 years ago by Haibao Tang 3.0k

score 2 · Answer 2 · 2010-12-02

For Sanger sequences, pregap4 actually is a good and very configurable tool, why not stick to it? You can even write small plugins to include new or own programs/filters into its functionality. Seqtrim does not look bad, but I never used it.

Apart from that, people I know use very different pipelines where pregap4, lucy, cross_match, blast, SSAHA2 and SMALT are among the most often encountered for Sanger. For 454, it's mostly the Roche pipeline (perhaps supported by SSAHA2/SMALT) while for Illumina I've seen SSAHA2, SMALT and the FASTX toolkit.

Ram · Answer 3 · 2010-12-23

2

Entering edit mode

14.0 years ago

James Hane ▴ 20

Seqclean can be pretty good if you modify the psx file where it calls the blast executable, to have the same parameters as NCBI VecScreen (-q -5 -G 3 -E 3 -F "m D" -e 700 -Y 1.75e12). After this it should reproduce the same results and VecScreen, which is pretty sensitive.

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 14.0 years ago by James Hane ▴ 20

Ram · Answer 4 · 2011-01-31

0

Entering edit mode

13.9 years ago

Vashar ▴ 20

Seqtrim results is same as seqclean withoout -v and -s options.

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 13.9 years ago by Vashar ▴ 20

score 0 · Answer 5 · 2011-05-04

I hope you made it work out! I have tried a bunch of vector screening tools, and not found one I am happy with. Whatever you end up using, make sure you verify the results, for instance by BLASTing against vector, linker and adaptor sequences. Especially those short, synthetic sequences tend to show up unexpectedly, and since they are small, will give very high E-values.