Question

Uncollapsing FASTA file

0

Entering edit mode

7.8 years ago

manekineko ▴ 150

Hi, I have fasta file with sequences names with copy number after _x

>name_x999999

They are collapsed with tool probably FASTx, but I cannot find tools/script for uncollapsing such files?

uncollapsing • 2.5k views

ADD COMMENT • link updated 7.8 years ago by cpad0112 21k • written 7.8 years ago by manekineko ▴ 150

0

Entering edit mode

what do you mean with 'Uncollapsing' ? input/output ?

ADD REPLY • link 7.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

They sequences are collapsed with tool similar to FASTX, it means similar sequences are represented as 1 with a copy number after a _ symbol or _x

>seq1_x2929
GAGATAGAGATAGAAGAGT
>seq2_x34
GAGAGAAAA
>seq3_x100
GAAAAATAAAAATAA

ADD REPLY • link 7.8 years ago by manekineko ▴ 150

0

Entering edit mode

I'm sorry, you're describing your input, but I still don't understand what is the desired output ?

ADD REPLY • link 7.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

If the sequences have been collapsed then there is no way to regenerate the original data (unless you are referring to re-generating identical 100 copies of seq3 in example above.

ADD REPLY • link 7.8 years ago by GenoMax 153k

0

Entering edit mode

Yes Thats what I want exactely :)

ADD REPLY • link 7.8 years ago by manekineko ▴ 150

0

Entering edit mode

7.8 years ago

Pierre Lindenbaum 166k

using awk

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fa |\
awk -F '[_\t]' '{T= int(substr($2,2));for(i=1;i<=T;i++) {printf("%s\n%s\n",$1,$3);}}'

the first awk linearize the sequence , the second extract and print the sequence 'x' times

ADD COMMENT • link 7.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

7.8 years ago

cpad0112 21k

using seqkit and bash:

$ for i in $(grep ">" test.fa); do  grep $i test.fa -A 1 | seqkit replace -p "_.+" | seqkit dup -n ${i#*s*x}; done > test.out.fa

output:

$ grep seq1 test.out.fa | wc -l
2929
$ grep seq2 test.out.fa | wc -l
34
$ grep seq3 test.out.fa | wc -l
100
$ grep 'GAAAAATAAAAATAA' test.out.fa  | wc -l
100

input test.fa:

$ cat test.fa
>seq1_x2929 
GAGATAGAGATAGAAGAGT
>seq2_x34 
GAGAGAAAA
>seq3_x100 
GAAAAATAAAAATAA

Assumptions: 1) sequences are linearized 2) All the numbers (eg 2929 in seq1) always are preceded by _x and headers start with "s"

ADD COMMENT • link 7.8 years ago by cpad0112 21k

score 1 · Accepted Answer · 2017-11-02

1

Entering edit mode

7.8 years ago

5heikki 11k

Another awk solution (assuming no linebreaks in sequences):

awk 'BEGIN{FS="_x"}{if(/^>/){x=$1;y=$2;z=getline}{for(i=1;i<=y;i++){print x"\n"$z}}}' file