Entering edit mode
5.8 years ago
namarino
▴
10
Hello,
I have a pretty big data FASTA file with protein sequences. I would like to filter it by length so that it excludes sequences that are at the ends (ie: Cut shortest 10% and the longest 10%). I've looked through the forums, but only see ones where people filter by specific length
So far, I've tried:
awk '{/>/&&++a||b+=length()}END{print b/a}' uniprot_input.fasta
From Mean Length Of Fasta Sequences
// The average length is 216.817
However, I'm really new at using the command line, awk, and handling files. Thanks in advance for your help!
How To Filter Multi Fasta By Length??
Separate by size sequences in a fasta file
Why don't you try to write it yourself for practice. First, go through the file to get a summary of the lengths. Then calculate what the shortest 10% and longest 90% are and feed this in a new awk command with appropriate if statements, only selecting sequences above the 10% and below the 90%. I am sure you can get this done, and it will help you solve these things in the future.
you can use seqkit to filter sequences by min and maximum length... namarino
True, but how do you calculate 10th and 90th percentile of lengths?
It's simple: sorting by length, calculating q10 and q90, and retrieving seqs in this range.
Tests:
Generating test dataset: 100000 seqs in length of 1 to 100bp
Calculating q10 and q90, then choosing target range:
Sorting and retrieving
Length distribution (uniform distribution)
Before
After
Length distribution (normal distribution)
Before http://i66.tinypic.com/2r6l2kh.png , After http://i65.tinypic.com/5oawpw.png
Thank you! I hadn't been much aware of seqkit, but will definitely start implementing it.
seems datamash can with
perc
operation. https://www.gnu.org/software/datamash/manual/datamash.htmlPlease use the formatting bar (especially the
code
option) to present your post better. You can use backticks for inline code (`text` becomestext
), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.Thank you! It was my first post, so I will definitely apply this in the future