Remove sequences below median or 75%-tile length in fasta files
1
0
Entering edit mode
9.2 years ago
lcscs12345 ▴ 30

I've hundreds of fasta files, each contain hundreds of sequences. Median sequence lengths are different between the files. I would like to remove sequences that are below median or 75%-tile length from each file. So far the scripts or tools I've came across such as USEARCH can only trim sequences based on user defined length. I'm looking for any useful ways to do the task including sed and awk. Any thought?

sequence • 3.0k views
ADD COMMENT
0
Entering edit mode

There are lots of relevant posts that may help you:

How To Filter Multi Fasta By Length??

A: Fasta Length

ADD REPLY
0
Entering edit mode

Thanks! But user defined length is not what I'm looking for because It is not sensible to process hundreds of file one by one.

ADD REPLY
1
Entering edit mode

You could script it. For example, if you have esl-seqstat in $PATH:

length=$(esl-seqstat file.fasta | grep -w "Average length" | tr -s " " | cut -f3 -d " "); echo $length

So in context of your 100s of files:

for f in *.fasta
do
  length=$(esl-seqstat $f | grep -w "Average length" | tr -s " " | cut -f3 -d " "); usearch -fileParameter $f -lengthParameter $length -otherParameters
done
ADD REPLY
0
Entering edit mode

Simple and elegant! This is exactly what I'm looking for, thank you.

ADD REPLY
1
Entering edit mode
9.2 years ago

Using BBTools, you can remove sequences via length like this:

reformat.sh in=file.fa out=filtered.fa minlen=1000

You can get the distribution of lengths using the same program:

reformat.sh in=file.fa lhist=lhist.txt

...which will give you the number of sequences of any given length; you'd then need to process the resulting file to determine the X-percentile (it's 2-column tab-delimited). You can also get the L50 and N50 from stats.sh, which might be easier to parse. And readlength.sh also may have some useful features; it's slightly different. I think lhist.txt contains the mean, median, and mode in the header, which would be easy to parse as well.

I do not think you will find a tool that does exactly what you want, though, because it's a pretty odd request. Why do you want to do that?

ADD COMMENT
0
Entering edit mode

Thanks! I obtained the fasta files using blastn with low stringency settings. However, many partial sequence were returned even at E-value < 0.0001. Because I'm looking for conservation of the sequences, I don't want to compromise using a lower E-value.

ADD REPLY
0
Entering edit mode

It makes much more sense now. I still don't have a better answer, though.

ADD REPLY

Login before adding your answer.

Traffic: 2884 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6