Uncollapsing FASTA file
3
Hi,
I have fasta file with sequences names with copy number after _x
>name_x999999
They are collapsed with tool probably FASTx, but I cannot find tools/script for uncollapsing such files?
uncollapsing
• 2.1k views
Another awk solution (assuming no linebreaks in sequences):
awk 'BEGIN{FS="_x"}{if(/^>/){x=$1;y=$2;z=getline}{for(i=1;i<=y;i++){print x"\n"$z}}}' file
using awk
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa |\
awk -F '[_\t]' '{T= int(substr($2,2));for(i=1;i<=T;i++) {printf("%s\n%s\n",$1,$3);}}'
the first awk linearize the sequence , the second extract and print the sequence 'x' times
using seqkit and bash:
$ for i in $(grep ">" test.fa); do grep $i test.fa -A 1 | seqkit replace -p "_.+" | seqkit dup -n ${i#*s*x}; done > test.out.fa
output:
$ grep seq1 test.out.fa | wc -l
2929
$ grep seq2 test.out.fa | wc -l
34
$ grep seq3 test.out.fa | wc -l
100
$ grep 'GAAAAATAAAAATAA' test.out.fa | wc -l
100
input test.fa:
$ cat test.fa
>seq1_x2929
GAGATAGAGATAGAAGAGT
>seq2_x34
GAGAGAAAA
>seq3_x100
GAAAAATAAAAATAA
Assumptions:
1) sequences are linearized
2) All the numbers (eg 2929 in seq1) always are preceded by _x and headers start with "s"
Login before adding your answer.
Traffic: 1808 users visited in the last hour
what do you mean with 'Uncollapsing' ? input/output ?
They sequences are collapsed with tool similar to FASTX, it means similar sequences are represented as 1 with a copy number after a _ symbol or _x
>seq1_x2929
GAGATAGAGATAGAAGAGT
>seq2_x34
GAGAGAAAA
>seq3_x100
GAAAAATAAAAATAA
I'm sorry, you're describing your input, but I still don't understand what is the desired output ?
If the sequences have been collapsed then there is no way to regenerate the original data (unless you are referring to re-generating
identical 100 copies
of seq3 in example above.Yes Thats what I want exactely :)