Script To Calculate Length Distribution Of A Collapsed Fasta File
2
0
Entering edit mode
11.0 years ago
liran0921 ▴ 150

Dear all,

I have a fasta file with the format as follows:

 >1-698675
TAATACTGCCTGGTAATGATGACT
 >2-69532
TACTGCCTGGTAATGA
 >3-6954
ATACTGCCTGGTAATGATGACT

The header >1-698675 indicates the read ID is 1 and the read count is 698675.

What i want to get is the length distribution of all the reads (eg, how many reads have the length of 17, 18,19, etc.). But as you see, the reads are collapsed so the available script I found online cannot apply to this fasta file. I appreciate if someone can provide the script to calculate the length distribution of this kind of fasta file.

Many thanks.

fasta length • 7.6k views
ADD COMMENT
1
Entering edit mode

What you mean by collapsed? Do you mean that the entire sequence is in one line ? Also, do you want to get the distribution of read count or the length of the reads. I am confused why you told us about read id and read count.

ADD REPLY
3
Entering edit mode
11.0 years ago

I'm assuming that these are miRNAs or some other small RNA and you want the distribution to account for the collapsing of identical reads (i.e., the 698675 reads of length 24 in the first entry should be added to however many copies of the next 24-length read comes next and so on). I also assume that the reads don't span lines, etc., etc.:

#!/usr/bin/env python
import sys
f = open(sys.argv[1], "r")

dists={}
while(1) :
    header=f.readline().rstrip()
    seq=f.readline().rstrip()
    if(header == "" or seq == "") :
        break

    offset = header.find("-")+1
    c = int(header[offset:])

    if(("%s" % len(seq)) not in dists) :
        dists.update({("%s" % len(seq)):c})
    else :
        dists[("%s" % len(seq))] += c

f.close()
for x in dists.keys() :
    print("%s: %i" % (x, dists[x]))

Then python foo.py some_file.fa if you saved that as "foo.py".

ADD COMMENT
1
Entering edit mode

Got it. Hasn't worked with miRNA for a while so i was kind of confused. Thanks.

ADD REPLY
0
Entering edit mode

Yeah, the original question wasn't exactly posed in a great way.

ADD REPLY
0
Entering edit mode

Thanks! That's exactly what I want.

ADD REPLY
1
Entering edit mode
11.0 years ago
Gabriel R. ★ 2.9k
cat file.fa |awk '{if(NR%2==0){print length($0)}}' |sort |uniq -c
ADD COMMENT

Login before adding your answer.

Traffic: 1888 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6