You can use a script like this to find the lengths
cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'
https://www.danielecook.com/generate-fasta-sequence-lengths/
And this script should give you length and count
cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,10) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | awk '{print $2}' | sort | uniq -c | awk '{print $1,$2}'
I changed the substr($0,2,100)
to substr($0,2,10)
because in my case my headers had tabs and that would mess up things, so I only got characters 2-10 of the header (character 1 was > ). You can play with this based on your headers or even use $1
instead of substr($0,2,100)
or even get rid of it (see the next command)
Simpler version:
cat file.fa | awk '$0 ~ ">" {print c;} $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | sort | uniq -c | awk '{print $1,$2}'
Wandered here from Google, a revision for others who land here: The 'Simpler version' has a bug, it needs to reset
c=0
at each sequence header. If you do not add this, your code will print cumulative lengths rather than the length of each read.Thanks, Nick, if I want to output a .tsv format table with tab-delimited length and count, how should I modify your command line?
Just found that
cat file.fa | awk '$0 ~ ">" {print c;c=0} $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | sort | uniq -c | awk '{print $2 "\t" $1}'
works for me.Thanks Fatima, this works well for me!