Question

Writing lengths of fasta sequences to another file

1

Entering edit mode

19 months ago

Hau ▴ 10

I currently have a script that can read a Fasta file, and print out the lengths of the constituent sequences. However, I'm having difficulties adapting it to both iteratively process a large quantity of files using os (specifically, it seems to stop at one file for some reason), and write the output to corresponding text files. Could anyone kindly assist with this issue?

header = None
length = 0
with open('x.ffn') as input_file:
    for line in input_file:
        line = line.rstrip()
        if line.startswith('>'):
            if header is not None:
                print(header, length)
                length = 0
            header = line[1:]
        else:
            length += len(line)

if length:
    print(header, length)

sequencing Python Fasta • 625 views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 19 months ago by Hau ▴ 10

1

Entering edit mode

as Corentin mentioned samtools faidx is the simplest and fastest method to get the information,

you could also use bioawk like so:

cat refs/AF086833.fa | bioawk -c fastx  ' { print $name, length($seq) }'

prints:

AF086833.2  18959

as for your program, you should mention what the error is and what does it mean that it "stops for some reason"?

ADD REPLY • link 19 months ago by Istvan Albert 102k

score 1 · Answer 1 · 2023-04-19

1

Entering edit mode

19 months ago

Corentin ▴ 610

Do you have to use a script to do that? If you generate an index for your fasta files with samtools faidx the second column will contain each sequence length. See here for more details about the ".fai" format: http://www.htslib.org/doc/faidx.html

ADD COMMENT • link 19 months ago by Corentin ▴ 610