Split multi fasta based on line header
1
0
Entering edit mode
5.4 years ago

Hello,

I have a multifasta containing several gene segments from multiple organisms. The fasta is formatted as so:

>OrganismA_GeneSegment1

ATG..

>OrganismA_GeneSegment2

ATG...

>OrganismB_GeneSegment1

ATG...

>OrganismB_GeneSegment2

ATG....

I would like to split this into several sub-multi fasta files, each containing all the segments from each organism and named after that organism. Also, preferably sorted in length descending...

Any suggestions appreciated!

Thanks.

sequence split fasta edit fasta order fasta • 3.6k views
ADD COMMENT
2
Entering edit mode

This awk script could work but does not sort by length.

awk '{if(substr($0,1,1) == ">"){split(substr($0,2,length($0)),a,/_/);filename=a[1]};print $0 > filename }' your_input_file.fa

ADD REPLY
0
Entering edit mode

@microfuge I find your suggestion very useful, but the files are saved without format specification. I´m new to awk and don´t get where to place the .fa in the script?

ADD REPLY
0
Entering edit mode

One idea would be to recover the headers (grep "^>" yourfile > headers). Split the headers into organism specific sub-files. Then use faSomeRecords utility from Jim Kent (UCSC) to get the data separated. Add execute permissions after downloading the utility (chmod a+x faSomeRecords).

ADD REPLY
0
Entering edit mode

USEARCH has couple of options to sort fasta sequences by length.

ADD REPLY
1
Entering edit mode
5.4 years ago

If you can code in Python, then your easiest shot is to use Biopython. Something like this:

#!/usr/bin/env python

from Bio import SeqIO
from operator import itemgetter

Separated_reads = {}

input_file = "<put file="" here="">"
INPUT = open(input_file, "r")

for line in SeqIO.parse(INPUT, "fasta"):

    read_name = strrecord.id)
    organism = read_name.split("_")[0]
    read_seq = str(record.seq)

    try:
        Separated_reads[organism].append((len(read_seq), ">" + read_name + "\n" + read_seq + "\n"))
    except:
        Separated_reads[organism] = [(len(read_seq), ">" + read_name + "\n" + read_seq + "\n")]

INPUT.close()

Sorted_reads = { organism : sorted(Separated_reads[organism], key=itemgetter(0), reverse=True) for organism in Separated_reads.keys() }

for organism in Sorted_reads.keys():
    OUTPUT = open("{0}.fa".format(organism), "w")

    for x in Sorted_reads[organism]:
        gene_segment = x[1]
        OUTPUT.write(gene_segment)

    OUTPUT.close()
ADD COMMENT

Login before adding your answer.

Traffic: 1032 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6