Question

Split fasta to multi fasta

0

Entering edit mode

7.8 years ago

felipelira3 • 0

I have a scaffold with N`s inside but I want to split it into separated contigs. The first reason is because I have N's and the other is because I have non-IUPAC characters into my sequence. Just trying I split by N's and eliminated the sequences smaller than 1.

Any suggestion using SeqIO or any other tool.

myfile.fasta

>mysequence
agtagatgatgatagatgatgatgaNNNNtgttgcatgctagctagctagtcgatcgatcgatcgtagctagcaNNNtcgatcgatgtagctagctgacaNctagtcgatgca

my temporary output.fasta using this:

sed -i.bak 's/N/\n>N\n/g' myfile.fasta

>N
agtagatgatgatagatgatgatga
>N

>N

>N

>N

>N
tgttgcatgctagctagctagtcgatcgatcgatcgtagctagca

>N

>N

>N
tcgatcgatgtagctagctgacaNctagtcgatgca

Further I eliminate the NULL sequences or filter >500 to obtain a reasonable set of sequences.

>N
agtagatgatgatagatgatgatga
>N
tgttgcatgctagctagctagtcgatcgatcgatcgtagctagca
>N
tcgatcgatgtagctagctgacaNctagtcgatgca

The problem you can imagine. All sequences have the same names.

How to enumerate them in this order?

sequence Assembly • 3.5k views

ADD COMMENT • link updated 7.8 years ago by shenwei356 8.7k • written 7.8 years ago by felipelira3 • 0

0

Entering edit mode

's/N/\n>N\n/g' takes effect on the FASTA headers. So you have to discard seq names first. And you should use regular expression [Nn]+' instead ofN` for splitting.

ADD REPLY • link 7.8 years ago by shenwei356 8.7k

score 1 · Answer 1 · 2017-04-03

1

Entering edit mode

7.8 years ago

Pierre Lindenbaum 164k

linearize and split with awk:

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  |\
 awk -F '\t' '{out=($2 ~ /[nN]/?"file1.fa":"file2.fa"); printf("%s\n%s\n",$1,$2) >> out;}'

ADD COMMENT • link 7.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

When you wrote the lines your intention was:

... {printf("\n");}' input.fasta |\ awk -F '\t' ...

no? Thank you

ADD REPLY • link 7.8 years ago by felipelira3 • 0

0

Entering edit mode

no .

ADD REPLY • link 7.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

So there are two separate lines?

ADD REPLY • link 7.8 years ago by felipelira3 • 0

0

Entering edit mode

no .

ADD REPLY • link 7.8 years ago by Pierre Lindenbaum 164k

score 0 · Answer 2 · 2017-04-03

Since you've updated, and asked for any tool, here's another python script.

#!/usr/bin/env python
import sys

with open(sys.argv[1], 'r') as f:
        with open(sys.argv[2], 'w') as out:
                header = next(f).strip()
                myseq = ''.join([line.strip() for line in f]).split('N')
                for n in range(len(myseq)):
                        if 'N' not in myseq[n]:
                                if len(myseq[n]) > 1:
                                        out.write(header + '_' + str(n), '\n', myseq[n]

This takes the input file with a single sequence, grabs the header and sequence. The sequence is split into a list, by N chars. For each item in that list, N is not in that portion of the sequence, and the length of that portion is > 1, it prints out to the new fasta file, the header plus '_n', where n is the index of the portion in the list.

Save as split_fasta.py, run as python split_fasta.py infile.fasta outfile.fasta

score 0 · Answer 3 · 2017-04-04

0

Entering edit mode

7.8 years ago

5heikki 11k

Original (fails with multifasta):

awk 'BEGIN{RS="N"; OFS="\n"; i=1}{if(length($NF)>1){print ">seq_"i,$NF; i++}}' seq.fa

New version (should work with everything):

sed 's/^>.*/N/' seqs.fa | awk 'BEGIN{RS="N";OFS="\n";i=1}{if(length($0)>1){print ">seq_"i,$0; i++}}' | grep .

Doesn't matter if there are linebreaks in sequences..

ADD COMMENT • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

This is cool, but there's a little bug?

$ cat seqs.fa 
>mysequence
agtagatgatgatagatgatgatgaNNNNtgttgcatgctagctagctagtcgatcgatc
gatcgtagctagcaNNNtcgatcgatgtagctagctgacaNctagtcgatgca
>another
aaaaaccccccNNNggggggggNttt

$ cat seqs.fa| awk 'BEGIN{RS="N"; OFS="\n"; i=1}{if(length($NF)>1){print ">seq_"i,$NF; i++}}'
>seq_1
agtagatgatgatagatgatgatga
>seq_2
gatcgtagctagca
>seq_3
tcgatcgatgtagctagctgaca
>seq_4
aaaaacccccc
>seq_5
gggggggg
>seq_6
ttt

ADD REPLY • link 7.8 years ago by shenwei356 8.7k

1

Entering edit mode

This could work, but is not elegant at all

sed 's/^>.*/N/' seqs.fa | awk 'BEGIN{RS="N";OFS="\n";i=1}{if(length($0)>1){print ">seq_"i,$0; i++}}' | grep .

ADD REPLY • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

Not just a little bug. It doesn't work at all. Lesson of the day: don't post answers while tired.

ADD REPLY • link 7.8 years ago by 5heikki 11k

1

Entering edit mode

Or don't post answers that were not tested.

ADD REPLY • link 7.8 years ago by shenwei356 8.7k

0

Entering edit mode

It worked fine with OP's example data though

ADD REPLY • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

oh, I see I see.

ADD REPLY • link 7.8 years ago by shenwei356 8.7k

score 0 · Answer 4 · 2017-04-04

shell + seqkit:

Steps:

format the FASTA in single line: seqkit seq -s -w 0
replace Ns with \n: sed -r 's/[Nn]+/\n/g'
add number for every line, and convert tab-delimited format to FASTA: cat -n | seqkit tab2fx
rename the FASTA header in format of contig_n: seqkit replace -p '.+' -r 'contig_{nr}'

Sample data:

$ cat seqs.fa 
>mysequence
agtagatgatgatagatgatgatgaNNNNtgttgcatgctagctagctagtcgatcgatc
gatcgtagctagcaNNNtcgatcgatgtagctagctgacaNctagtcgatgca
>another
aaaaaccccccNNNggggggggNttt

Commands:

$ cat seqs.fa | seqkit seq -s -w 0 \
    | sed -r 's/[Nn]+/\n/g' \
    | cat -n | seqkit tab2fx \
    | seqkit replace -p '.+' -r 'contig_{nr}'

>contig_1
agtagatgatgatagatgatgatga
>contig_2
tgttgcatgctagctagctagtcgatcgatcgatcgtagctagca
>contig_3
tcgatcgatgtagctagctgaca
>contig_4
ctagtcgatgca
>contig_5
aaaaacccccc
>contig_6
gggggggg
>contig_7
ttt

Ways to filter by length:

seqkit seq --min-len 100 --max-len 1000