Question

Script To Calculate Length Distribution Of A Collapsed Fasta File

0

Entering edit mode

11.0 years ago

liran0921 ▴ 150

Dear all,

I have a fasta file with the format as follows:

 >1-698675
TAATACTGCCTGGTAATGATGACT
 >2-69532
TACTGCCTGGTAATGA
 >3-6954
ATACTGCCTGGTAATGATGACT

The header >1-698675 indicates the read ID is 1 and the read count is 698675.

What i want to get is the length distribution of all the reads (eg, how many reads have the length of 17, 18,19, etc.). But as you see, the reads are collapsed so the available script I found online cannot apply to this fasta file. I appreciate if someone can provide the script to calculate the length distribution of this kind of fasta file.

Many thanks.

fasta length • 7.6k views

ADD COMMENT • link updated 5.0 years ago by Antonio R. Franco ★ 5.2k • written 11.0 years ago by liran0921 ▴ 150

1

Entering edit mode

What you mean by collapsed? Do you mean that the entire sequence is in one line ? Also, do you want to get the distribution of read count or the length of the reads. I am confused why you told us about read id and read count.

ADD REPLY • link 11.0 years ago by Ashutosh Pandey 12k

score 3 · Answer 1 · 2013-12-10

3

Entering edit mode

11.0 years ago

Devon Ryan 105k

I'm assuming that these are miRNAs or some other small RNA and you want the distribution to account for the collapsing of identical reads (i.e., the 698675 reads of length 24 in the first entry should be added to however many copies of the next 24-length read comes next and so on). I also assume that the reads don't span lines, etc., etc.:

#!/usr/bin/env python
import sys
f = open(sys.argv[1], "r")

dists={}
while(1) :
    header=f.readline().rstrip()
    seq=f.readline().rstrip()
    if(header == "" or seq == "") :
        break

    offset = header.find("-")+1
    c = int(header[offset:])

    if(("%s" % len(seq)) not in dists) :
        dists.update({("%s" % len(seq)):c})
    else :
        dists[("%s" % len(seq))] += c

f.close()
for x in dists.keys() :
    print("%s: %i" % (x, dists[x]))

Then python foo.py some_file.fa if you saved that as "foo.py".

ADD COMMENT • link 11.0 years ago by Devon Ryan 105k

1

Entering edit mode

Got it. Hasn't worked with miRNA for a while so i was kind of confused. Thanks.

ADD REPLY • link 11.0 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Yeah, the original question wasn't exactly posed in a great way.

ADD REPLY • link 11.0 years ago by Devon Ryan 105k

0

Entering edit mode

Thanks! That's exactly what I want.

ADD REPLY • link 11.0 years ago by liran0921 ▴ 150

score 1 · Answer 2 · 2013-12-10

1

Entering edit mode

11.0 years ago

Gabriel R. ★ 2.9k

cat file.fa |awk '{if(NR%2==0){print length($0)}}' |sort |uniq -c

ADD COMMENT • link 11.0 years ago by Gabriel R. ★ 2.9k