Uncollapsing FASTA file
3
0
Entering edit mode
7.1 years ago
manekineko ▴ 150

Hi, I have fasta file with sequences names with copy number after _x

>name_x999999

They are collapsed with tool probably FASTx, but I cannot find tools/script for uncollapsing such files?

uncollapsing • 2.1k views
ADD COMMENT
0
Entering edit mode

what do you mean with 'Uncollapsing' ? input/output ?

ADD REPLY
0
Entering edit mode

They sequences are collapsed with tool similar to FASTX, it means similar sequences are represented as 1 with a copy number after a _ symbol or _x

>seq1_x2929
GAGATAGAGATAGAAGAGT
>seq2_x34
GAGAGAAAA
>seq3_x100
GAAAAATAAAAATAA

ADD REPLY
0
Entering edit mode

I'm sorry, you're describing your input, but I still don't understand what is the desired output ?

ADD REPLY
0
Entering edit mode

If the sequences have been collapsed then there is no way to regenerate the original data (unless you are referring to re-generating identical 100 copies of seq3 in example above.

ADD REPLY
0
Entering edit mode

Yes Thats what I want exactely :)

ADD REPLY
1
Entering edit mode
7.1 years ago
5heikki 11k

Another awk solution (assuming no linebreaks in sequences):

awk 'BEGIN{FS="_x"}{if(/^>/){x=$1;y=$2;z=getline}{for(i=1;i<=y;i++){print x"\n"$z}}}' file
ADD COMMENT
0
Entering edit mode

Many thanks! I'm testing the awk and seems to got the right files I want :)

ADD REPLY
0
Entering edit mode
7.1 years ago

using awk

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}'  input.fa |\
awk -F '[_\t]' '{T= int(substr($2,2));for(i=1;i<=T;i++) {printf("%s\n%s\n",$1,$3);}}'

the first awk linearize the sequence , the second extract and print the sequence 'x' times

ADD COMMENT
0
Entering edit mode
7.1 years ago

using seqkit and bash:

$ for i in $(grep ">" test.fa); do  grep $i test.fa -A 1 | seqkit replace -p "_.+" | seqkit dup -n ${i#*s*x}; done > test.out.fa

output:

$ grep seq1 test.out.fa | wc -l
2929
$ grep seq2 test.out.fa | wc -l
34
$ grep seq3 test.out.fa | wc -l
100
$ grep 'GAAAAATAAAAATAA' test.out.fa  | wc -l
100

input test.fa:

$ cat test.fa
>seq1_x2929 
GAGATAGAGATAGAAGAGT
>seq2_x34 
GAGAGAAAA
>seq3_x100 
GAAAAATAAAAATAA

Assumptions: 1) sequences are linearized 2) All the numbers (eg 2929 in seq1) always are preceded by _x and headers start with "s"

ADD COMMENT

Login before adding your answer.

Traffic: 1540 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6