How to generate sequence length distribution from Fasta file
2
2
Entering edit mode
4.9 years ago
2822462298 ▴ 120

I got a fasta file of miRNAs with sequence length ranging from 17-32. How can I generate a table of length distribution like:

length      count
17            x
18            y
19            z

Also, any idea if I use fastq files? Do I need to convert fastq to fasta first?

RNA-Seq sRNA-seq fatsa • 7.0k views
ADD COMMENT
3
Entering edit mode
4.9 years ago
Fatima ▴ 1000

You can use a script like this to find the lengths

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

https://www.danielecook.com/generate-fasta-sequence-lengths/

And this script should give you length and count

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,10) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | awk '{print $2}' | sort | uniq -c | awk '{print $1,$2}'

I changed the substr($0,2,100) to substr($0,2,10) because in my case my headers had tabs and that would mess up things, so I only got characters 2-10 of the header (character 1 was > ). You can play with this based on your headers or even use $1 instead of substr($0,2,100) or even get rid of it (see the next command)

Simpler version:

cat file.fa | awk '$0 ~ ">" {print c;} $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | sort | uniq -c | awk '{print $1,$2}'
ADD COMMENT
2
Entering edit mode

Wandered here from Google, a revision for others who land here: The 'Simpler version' has a bug, it needs to reset c=0 at each sequence header. If you do not add this, your code will print cumulative lengths rather than the length of each read.

cat file.fa | awk '$0 ~ ">" {print c;c=0} $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | sort | uniq -c | awk '{print $1,$2}'
ADD REPLY
0
Entering edit mode

Thanks, Nick, if I want to output a .tsv format table with tab-delimited length and count, how should I modify your command line?

ADD REPLY
0
Entering edit mode

Just found that cat file.fa | awk '$0 ~ ">" {print c;c=0} $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | sort | uniq -c | awk '{print $2 "\t" $1}' works for me.

ADD REPLY
0
Entering edit mode

Thanks Fatima, this works well for me!

ADD REPLY
1
Entering edit mode
4.9 years ago
Strand NGS ▴ 40

If you are comfortable in R, you can use the following code. Note: You will have to install the package seqinr before trying this

library(seqinr)
f <- read.fasta("input.fa")
table(getLength(f))
ADD COMMENT
1
Entering edit mode

Thank you! This is useful but it really took a long time to read the fasta file in r

ADD REPLY

Login before adding your answer.

Traffic: 1946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6