Hi,
I'm learning to work with Unix for bioinformatics and I have a word file with some 60 proteins sequence pasted in it. is there a way to find which protein has the maximum number of amino acids and the one with the minimum?
Thank you
Hi,
I'm learning to work with Unix for bioinformatics and I have a word file with some 60 proteins sequence pasted in it. is there a way to find which protein has the maximum number of amino acids and the one with the minimum?
Thank you
If you are asking whether it can be done in the Word application, the answer is no. Most commonly used bioinformatics programs don't read Word files, so it would be difficult to do it in other programs as well using the format you have. However, if you convert your file into plain text, there are tools that can do the task. The exact way of doing it will depend on your comfort in using these tools since most of them are not "point and click" applications. If you tell us what you have tried and what resources are available to you, it would be easier to give advice.
It may help to go through the results of a simple Google search for shortest and longest sequence in fasta file.
Pretty sure I'm going to be downvoted for this post.
You can paste your protein sequences to Excel, use the LEN()
function that returns the length of a string. You can then sort the sequences according to their lengths or use MIN()
and MAX()
function to get the minimum and maximum sequence size.
Nah, there is no downvote option here.
Here is my question: is there a simple way of pasting protein sequences, presumably in FASTA format, from Word to Excel? All the ways I can think of involve more work than doing it outside of Word/Excel. However, this is a way of doing it using only Microsoft applications, so it may be preferable to those who like them.
Answering on your comment that you have it now in fasta format you can use either R with Biostrings (which you seem to be learning now) or awk
.
Example data:
cat test.fa
>chr1
ATGCTAGCTAGCATCG
>chr2
TAGC
>chr3
GATCGATCGATCG
>chr4
TGACTGATCGACTAGCTAGCTACGTACGTACGATGCA
>chr5
GATCGATCGTACGATCG
1) Solution in R:
library(Biostrings)
#/ read fasta (for amino acids I think it is readAAStringSet):
fa <- readDNAStringSet("test.fa")
#/ get shortest and longest via width()
w <- width(fa)
fa_final <- fa[c(which(w==min(w)), which(w==max(w)))]
#/ save back to disk:
writeXStringSet(fa_final, "test2.fa")
2) Solution with awk
(people much better at awk than me can for sure squeeze this into a single command):
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < test.fa \
| awk 'OFS="\t" {print $1, $2, length($2) | "sort -k3,3n"}' \
| awk '{ if(NR ==1){print $1"\n"$2 }}END {print $1"\n"$2}'
>chr2
TAGC
>chr4
TGACTGATCGACTAGCTAGCTACGTACGTACGATGCA
First linearize the fasta (two columns tab separated), then print an additional column with the seq length, sort by length so shortest is the first and longest the last entry, then select first and last entry, and write back to fasta format.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What have you tried?
Sorry for replying late, I recently installed Bioconductor on r and was trying to do it there instead of in the terminal of my mac, my file is saved in .fa format