Binning fasta sequences by size?
4
0
Entering edit mode
7.0 years ago
stacy734 ▴ 40

Hi everyone,

I have a large fasta file and need to bin it by size: everything 200 nt and up in one file, and everything 199 or smaller in another. I have found some useful scripts that will remove the smaller sequences, but they are discarded and I need to keep them in a separate file.

Can anyone suggest a perl script (or anything similar) that will sort out sequences into two files by size cutoff?

Thanks in advance for any advice.

Stacy

fasta • 3.4k views
ADD COMMENT
0
Entering edit mode

Thanks to you both!

ADD REPLY
0
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY
4
Entering edit mode
7.0 years ago
st.ph.n ★ 2.7k

Linearize FASTA if you have multiple lines of sequences between each header.

#!/usr/bin/env python

import sys
lt200 = open(sys.argv[2])
gt200 = open(sys.argv[3])

with open(sys.argv[1], 'r') as f:
    for line in f:
        if line.startswith(">"):
            header = line.strip()
            seq = next(f).strip()
            if len(seq) >= 200:
                gt200.write(header + '\n' + seq)
            else:
                lt200.write(header + '\n' + seq)

lt200.close()
gt200.close()

save as bin_by_len.py, run as python bin_by_len.py input.fasta lt200.fasta gt200.fasta.

ADD COMMENT
3
Entering edit mode
7.0 years ago
GenoMax 148k

Use reformat.sh from BBMap suite.

reformat.sh in=your_file.fa minlen=200 out=more_than_200.fa
reformat.sh in=your_file.fa maxlen=200 out=less_than_200.fa
ADD COMMENT
2
Entering edit mode
7.0 years ago

linearize and dispatch with awk:

rm -f file1.fa file2.fa
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa   |\
awk -F '\t' '{out=length($2)<200?"file1.fa":"file2.fa";printf(">%s\n%s\n",$1,$2) >> out;}'
ADD COMMENT
1
Entering edit mode
7.0 years ago

The faidx utility in pyfaidx has this capability built-in:

$ pip install pyfaidx
$ faidx --size-range 1,199 file.fa > smalls.fa
$ faidx --size-range 200,200000000 file.fa > bigs.fa

An advantage of this approach is that you read the sizes from the .fai index file each time instead of computing the sequence lengths, so for lots of bins this will be much faster.

ADD COMMENT
0
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 1945 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6