Question

How to get a FASTA file where you filter by length to exclude sortest and longest 10%

0

Entering edit mode

6.8 years ago

namarino ▴ 10

Hello,

I have a pretty big data FASTA file with protein sequences. I would like to filter it by length so that it excludes sequences that are at the ends (ie: Cut shortest 10% and the longest 10%). I've looked through the forums, but only see ones where people filter by specific length

So far, I've tried:

awk '{/>/&&++a||b+=length()}END{print b/a}' uniprot_input.fasta

From Mean Length Of Fasta Sequences

// The average length is 216.817

However, I'm really new at using the command line, awk, and handling files. Thanks in advance for your help!

alignment sequence fasta linux awk • 4.1k views

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 6.8 years ago by namarino ▴ 10

0

Entering edit mode

How To Filter Multi Fasta By Length??

Separate by size sequences in a fasta file

ADD REPLY • link 6.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Why don't you try to write it yourself for practice. First, go through the file to get a summary of the lengths. Then calculate what the shortest 10% and longest 90% are and feed this in a new awk command with appropriate if statements, only selecting sequences above the 10% and below the 90%. I am sure you can get this done, and it will help you solve these things in the future.

ADD REPLY • link 6.8 years ago by ATpoint 90k

0

Entering edit mode

you can use seqkit to filter sequences by min and maximum length... namarino

ADD REPLY • link 6.8 years ago by cpad0112 21k

0

Entering edit mode

True, but how do you calculate 10th and 90th percentile of lengths?

ADD REPLY • link 6.8 years ago by Ram 45k

1

Entering edit mode

It's simple: sorting by length, calculating q10 and q90, and retrieving seqs in this range.

# Step 1
#  Getting total number
n=$(seqkit stats test.fa -T | sed 1d | cut -f 4)

# Step 2, option A
# "seqkit sort -l" needs unique headers.
# "seqkit range" retrieve records in a range.
seqkit rename test.fa | seqkit sort -l \
    | seqkit range -r $(($n/10+1)):$(($n*9/10)) > result.fa

# Step 2, option B. Use this ↓↓↓
# another version keeping original FASTA header
seqkit fx2tab -l test.fa | sort -k4,4n | seqkit tab2fx \
    | seqkit range -r $(($n/10+1)):$(($n*9/10)) > result.fa

Tests:

Generating test dataset: 100000 seqs in length of 1 to 100bp

$ cat gen-rand-seqs.py 
#!/usr/bin/env python
import numpy as np

for n in ( x * 100 for x in np.random.normal(0.5, 0.1, 100000) ):
    print(">{}\n{}".format(int(n), "A"*int(n)))

#normal distribution
python gen-rand-seqs.py > test.fa

# uniform distribution
perl -e 'for (1..100000){$n=rand(100)+1; printf(">%d\n%s\n", $n, "A" x $n);}' > test.fa

# plot length distribution
seqkit fx2tab -l test.fa | cut -f 4 | csvtk plot hist -H --x-min 0 --x-max 100 --bins 50  > 1.png

# summary
seqkit stats test.fa
file     format  type  num_seqs    sum_len  min_len  avg_len  max_len
test.fa  FASTA   DNA    100,000  4,953,319        8     49.5       99

Calculating q10 and q90, then choosing target range:

n=$(grep -c '^>' test.fa)

echo $(($n/10+1)):$(($n*9/10))
10001:90000

Sorting and retrieving

seqkit fx2tab -l test.fa | sort -k4,4n | seqkit tab2fx \
    | seqkit range -r $(($n/10+1)):$(($n*9/10)) > result.fa

# length distribution
seqkit fx2tab -l result.fa | cut -f 4 | csvtk plot hist -H --x-min 0 --x-max 100 --bins 40 > 2.png

Length distribution (uniform distribution)

Before

After
Length distribution (normal distribution)

Before http://i66.tinypic.com/2r6l2kh.png , After http://i65.tinypic.com/5oawpw.png

ADD REPLY • link 6.8 years ago by shenwei356 8.7k

0

Entering edit mode

Thank you! I hadn't been much aware of seqkit, but will definitely start implementing it.

ADD REPLY • link 6.8 years ago by namarino ▴ 10

1

Entering edit mode

seems datamash can with perc operation. https://www.gnu.org/software/datamash/manual/datamash.html

ADD REPLY • link 6.8 years ago by shenwei356 8.7k

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting