Question

Remove sequences below median or 75%-tile length in fasta files

0

Entering edit mode

9.2 years ago

lcscs12345 ▴ 30

I've hundreds of fasta files, each contain hundreds of sequences. Median sequence lengths are different between the files. I would like to remove sequences that are below median or 75%-tile length from each file. So far the scripts or tools I've came across such as USEARCH can only trim sequences based on user defined length. I'm looking for any useful ways to do the task including sed and awk. Any thought?

sequence • 3.0k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by lcscs12345 ▴ 30

0

Entering edit mode

There are lots of relevant posts that may help you:

How To Filter Multi Fasta By Length??

A: Fasta Length

ADD REPLY • link 9.2 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks! But user defined length is not what I'm looking for because It is not sensible to process hundreds of file one by one.

ADD REPLY • link 9.2 years ago by lcscs12345 ▴ 30

1

Entering edit mode

You could script it. For example, if you have esl-seqstat in $PATH:

length=$(esl-seqstat file.fasta | grep -w "Average length" | tr -s " " | cut -f3 -d " "); echo $length

So in context of your 100s of files:

for f in *.fasta
do
  length=$(esl-seqstat $f | grep -w "Average length" | tr -s " " | cut -f3 -d " "); usearch -fileParameter $f -lengthParameter $length -otherParameters
done

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by 5heikki 11k

0

Entering edit mode

Simple and elegant! This is exactly what I'm looking for, thank you.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by lcscs12345 ▴ 30

Ram · Answer 1 · 2015-09-30

1

Entering edit mode

9.2 years ago

Brian Bushnell 20k

Using BBTools, you can remove sequences via length like this:

reformat.sh in=file.fa out=filtered.fa minlen=1000

You can get the distribution of lengths using the same program:

reformat.sh in=file.fa lhist=lhist.txt

...which will give you the number of sequences of any given length; you'd then need to process the resulting file to determine the X-percentile (it's 2-column tab-delimited). You can also get the L50 and N50 from stats.sh, which might be easier to parse. And readlength.sh also may have some useful features; it's slightly different. I think lhist.txt contains the mean, median, and mode in the header, which would be easy to parse as well.

I do not think you will find a tool that does exactly what you want, though, because it's a pretty odd request. Why do you want to do that?

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks! I obtained the fasta files using blastn with low stringency settings. However, many partial sequence were returned even at E-value < 0.0001. Because I'm looking for conservation of the sequences, I don't want to compromise using a lower E-value.

ADD REPLY • link 9.2 years ago by lcscs12345 ▴ 30

0

Entering edit mode

It makes much more sense now. I still don't have a better answer, though.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Brian Bushnell 20k