Writing lengths of fasta sequences to another file
1
1
Entering edit mode
19 months ago
Hau ▴ 10

I currently have a script that can read a Fasta file, and print out the lengths of the constituent sequences. However, I'm having difficulties adapting it to both iteratively process a large quantity of files using os (specifically, it seems to stop at one file for some reason), and write the output to corresponding text files. Could anyone kindly assist with this issue?

header = None
length = 0
with open('x.ffn') as input_file:
    for line in input_file:
        line = line.rstrip()
        if line.startswith('>'):
            if header is not None:
                print(header, length)
                length = 0
            header = line[1:]
        else:
            length += len(line)

if length:
    print(header, length)
sequencing Python Fasta • 620 views
ADD COMMENT
1
Entering edit mode

as Corentin mentioned samtools faidx is the simplest and fastest method to get the information,

you could also use bioawk like so:

cat refs/AF086833.fa | bioawk -c fastx  ' { print $name, length($seq) }'

prints:

AF086833.2  18959

as for your program, you should mention what the error is and what does it mean that it "stops for some reason"?

ADD REPLY
1
Entering edit mode
19 months ago
Corentin ▴ 610

Do you have to use a script to do that? If you generate an index for your fasta files with samtools faidx the second column will contain each sequence length. See here for more details about the ".fai" format: http://www.htslib.org/doc/faidx.html

ADD COMMENT

Login before adding your answer.

Traffic: 2894 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6