Question

How to generate sequence length distribution from Fasta file

2

Entering edit mode

4.8 years ago

2822462298 ▴ 120

I got a fasta file of miRNAs with sequence length ranging from 17-32. How can I generate a table of length distribution like:

length      count
17            x
18            y
19            z

Also, any idea if I use fastq files? Do I need to convert fastq to fasta first?

RNA-Seq sRNA-seq fatsa • 6.9k views

ADD COMMENT • link updated 10 weeks ago by xiaoleiusc ▴ 140 • written 4.8 years ago by 2822462298 ▴ 120

score 3 · Answer 1 · 2020-02-13

3

Entering edit mode

4.8 years ago

Fatima ▴ 1000

You can use a script like this to find the lengths

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

https://www.danielecook.com/generate-fasta-sequence-lengths/

And this script should give you length and count

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,10) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | awk '{print $2}' | sort | uniq -c | awk '{print $1,$2}'

I changed the substr($0,2,100) to substr($0,2,10) because in my case my headers had tabs and that would mess up things, so I only got characters 2-10 of the header (character 1 was > ). You can play with this based on your headers or even use $1 instead of substr($0,2,100) or even get rid of it (see the next command)

Simpler version:

cat file.fa | awk '$0 ~ ">" {print c;} $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | sort | uniq -c | awk '{print $1,$2}'

ADD COMMENT • link 4.8 years ago by Fatima ▴ 1000

2

Entering edit mode

Wandered here from Google, a revision for others who land here: The 'Simpler version' has a bug, it needs to reset c=0 at each sequence header. If you do not add this, your code will print cumulative lengths rather than the length of each read.

cat file.fa | awk '$0 ~ ">" {print c;c=0} $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | sort | uniq -c | awk '{print $1,$2}'

ADD REPLY • link 2.2 years ago by nick.carleson ▴ 30

0

Entering edit mode

Thanks, Nick, if I want to output a .tsv format table with tab-delimited length and count, how should I modify your command line?

ADD REPLY • link 10 weeks ago by xiaoleiusc ▴ 140

0

Entering edit mode

Just found that cat file.fa | awk '$0 ~ ">" {print c;c=0} $0 !~ ">" {c+=length($0);} END { print c; }' | sed '/^$/d' | sort | uniq -c | awk '{print $2 "\t" $1}' works for me.

ADD REPLY • link 10 weeks ago by xiaoleiusc ▴ 140

0

Entering edit mode

Thanks Fatima, this works well for me!

ADD REPLY • link 4.8 years ago by 2822462298 ▴ 120

Ram · Answer 2 · 2020-02-13

1

Entering edit mode

4.8 years ago

Strand NGS ▴ 40

If you are comfortable in R, you can use the following code. Note: You will have to install the package seqinr before trying this

library(seqinr)
f <- read.fasta("input.fa")
table(getLength(f))

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 4.8 years ago by Strand NGS ▴ 40

1

Entering edit mode

Thank you! This is useful but it really took a long time to read the fasta file in r

ADD REPLY • link 4.8 years ago by 2822462298 ▴ 120