Question

How to find the fasta file with maxium number of amino acids?

1

Entering edit mode

4.0 years ago

shahbaz.ahmed ▴ 40

Hi,

I'm learning to work with Unix for bioinformatics and I have a word file with some 60 proteins sequence pasted in it. is there a way to find which protein has the maximum number of amino acids and the one with the minimum?

Thank you

protein fasta sequence • 2.9k views

ADD COMMENT • link 3.9 years ago by shahbaz.ahmed ▴ 40

0

Entering edit mode

What have you tried?

ADD REPLY • link 4.0 years ago by swbarnes2 15k

0

Entering edit mode

Sorry for replying late, I recently installed Bioconductor on r and was trying to do it there instead of in the terminal of my mac, my file is saved in .fa format

ADD REPLY • link 3.9 years ago by shahbaz.ahmed ▴ 40

score 1 · Answer 1 · 2021-06-29

If you are asking whether it can be done in the Word application, the answer is no. Most commonly used bioinformatics programs don't read Word files, so it would be difficult to do it in other programs as well using the format you have. However, if you convert your file into plain text, there are tools that can do the task. The exact way of doing it will depend on your comfort in using these tools since most of them are not "point and click" applications. If you tell us what you have tried and what resources are available to you, it would be easier to give advice.

It may help to go through the results of a simple Google search for shortest and longest sequence in fasta file.

score 1 · Answer 2 · 2021-06-29

1

Entering edit mode

4.0 years ago

Andrzej Zielezinski 11k

Pretty sure I'm going to be downvoted for this post.

You can paste your protein sequences to Excel, use the LEN() function that returns the length of a string. You can then sort the sequences according to their lengths or use MIN() and MAX() function to get the minimum and maximum sequence size.

Finding length of sequences in Excel

ADD COMMENT • link 4.0 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Nah, there is no downvote option here.

Here is my question: is there a simple way of pasting protein sequences, presumably in FASTA format, from Word to Excel? All the ways I can think of involve more work than doing it outside of Word/Excel. However, this is a way of doing it using only Microsoft applications, so it may be preferable to those who like them.

ADD REPLY • link 4.0 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

I can't think of an easy way to put sequences into an excel sheet. It would probably require multiple use of the "Find and replace" option. To be clear, my answer was a bit satirical.

ADD REPLY • link 4.0 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

OP didn't bother to say if the files were fasta format or not. There might be a way to make a macro to convert a fasta into single-line entries like that, but if one is going to learn how to program to do this, learning how to make macros might not be the best choice.

ADD REPLY • link 4.0 years ago by swbarnes2 15k

score 1 · Answer 3 · 2021-07-05

Answering on your comment that you have it now in fasta format you can use either R with Biostrings (which you seem to be learning now) or awk.

Example data:

cat test.fa
>chr1
ATGCTAGCTAGCATCG
>chr2
TAGC
>chr3
GATCGATCGATCG
>chr4
TGACTGATCGACTAGCTAGCTACGTACGTACGATGCA
>chr5
GATCGATCGTACGATCG

1) Solution in R:

library(Biostrings)

#/ read fasta (for amino acids I think it is readAAStringSet):
fa <- readDNAStringSet("test.fa")

#/ get shortest and longest via width()
w <- width(fa)
fa_final <- fa[c(which(w==min(w)), which(w==max(w)))]

#/ save back to disk:
writeXStringSet(fa_final, "test2.fa")

2) Solution with awk (people much better at awk than me can for sure squeeze this into a single command):

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < test.fa \
| awk 'OFS="\t" {print $1, $2, length($2) | "sort -k3,3n"}' \
| awk '{ if(NR ==1){print $1"\n"$2 }}END {print $1"\n"$2}'
>chr2
TAGC
>chr4
TGACTGATCGACTAGCTAGCTACGTACGTACGATGCA

First linearize the fasta (two columns tab separated), then print an additional column with the seq length, sort by length so shortest is the first and longest the last entry, then select first and last entry, and write back to fasta format.