length of specific sequences from multi fasta file
2
1
Entering edit mode
5.2 years ago

I have a list of fasta headers in a file (list.txt). I want to get the size or the length of the sequences of list.txt file from a big multi fasta file (main.fasta). Please let me know how to do this, Thank you,

sequence • 2.0k views
ADD COMMENT
0
Entering edit mode

The awk commands do the job. Thank you.

ADD REPLY
0
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY
1
Entering edit mode
5.2 years ago
$ grep --no-group-separator -h -A 1 -wf file.txt *.fa | awk -v OFS="\t" 'NR==1 {print "sequence","length"}; BEGIN{RS=">"} NR>1 {print $1, length($2)}'

if you have one fasta file, try following line:

$ awk -v OFS="\t" 'NR==FNR {a[$1]=$1; next} BEGIN{RS=">"}  ($1 in a) {print a[$1], length($2)}' file.txt a.fa

Assuming that all fasta files are linearized. If you have smaller number of fasta files, try:

$ awk -v OFS="\t"  'BEGIN{RS=">"} {print $1, length($2)}' *.fa | grep -f file.txt
ADD COMMENT
0
Entering edit mode
5.2 years ago
Mensur Dlakic ★ 28k

I suspect you will get several different solutions. For this one you will need a program called esl-seqstat from the HMMer package.

esl-seqstat -a main.fasta | grep -f list.txt > list-lengths.txt

Sequence length will be in third column of list-lengths.txt.

ADD COMMENT

Login before adding your answer.

Traffic: 2288 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6