That's not that hard by matching the SNPs and flanking sequences using a regular expression.
To make it easy to process, let's linearize the FASTA sequences,
i.e. converting to tabular format. For example:
$ seqkit fx2tab seqs.fa
2|28804|28804|rs544393009|dbSNP|C/T TGCAT_TCAGC
Hereby, we can follow the original answer with little modification.
Note that the 1 or 2 before allels are used to keeping order, which will be
removed in the end.
# FASTA -> tabular
# replace 'C/T\tTGCAT_TCAGC' with '1C\tTGCATCTCAGC'
# tabular -> FASTA
$ seqkit fx2tab seqs.fa | \
perl -pe 's/(\w+)\/(\w+)\t(\w+)_(\w+)/1$1\t$3$1$4/' | \
seqkit tab2fx > seqs.allel1.fa
# FASTA -> tabular
# replace 'C/T\tTGCAT_TCAGC' with '2T TGCATTTCAGC'
# tabular -> FASTA
$ seqkit fx2tab seqs.fa | \
perl -pe 's/(\w+)\/(\w+)\t(\w+)_(\w+)/2$2\t$3$2$4/' | \
seqkit tab2fx > seqs.allel2.fa
Then sort records in the two files by sequence ID in alphabetical order
# sort by ID
# remove the 1/2 before alleles
$ seqkit sort seqs.allel*.fa |\
perl -pe 's/\|\d(\w+)$/\|$1/'
>2|28396|28396|rs781239227|dbSNP|T
AATGCTCTTGG
>2|28396|28396|rs781239227|dbSNP|C
AATGCCCTTGG
>2|28804|28804|rs544393009|dbSNP|C
TGCATCTCAGC
>2|28804|28804|rs544393009|dbSNP|T
TGCATTTCAGC
>2|29358|29358|rs943169482|dbSNP|G
CCCAAGAAGCC
>2|29358|29358|rs943169482|dbSNP|GAAAA
CCCAAGAAAAAAGCC
Original answer:
That's not that hard by matching the SNPs and flanking sequences using a regular expression.
# replace 'C/T TGCAT_TCAGC' with 'C TGCATCTCAGC'
$ perl -pe 's/(\w+)\/(\w+) (\w+)_(\w+)/$1 $3$1$4/' seqs.fa > seqs.allel1.fa
# replace 'C/T TGCAT_TCAGC' with 'T TGCATTTCAGC'
$ perl -pe 's/(\w+)\/(\w+) (\w+)_(\w+)/$2 $3$2$4/' seqs.fa > seqs.allel2.fa
If the order between two allels is not important, just sort them by sequence ID in alphabetical order:
$ seqkit sort seqs.allel*.fa
>2|28804|28804|rs544393009|dbSNP|C TGCATCTCAGC
ACTGN
>2|28804|28804|rs544393009|dbSNP|T TGCATTTCAGC
ACTGN
>2|29358|29358|rs943169482|dbSNP|G CCCAAGAAGCC
actgn
>2|29358|29358|rs943169482|dbSNP|GAAAA CCCAAGAAAAAAGCC
actgn
If the order is important, it can be done with a trick.
# replace 'C/T TGCAT_TCAGC' with '1C TGCATCTCAGC'
$ perl -pe 's/(\w+)\/(\w+) (\w+)_(\w+)/1$1 $3$1$4/' seqs.fa > seqs.allel1.fa
# replace 'C/T TGCAT_TCAGC' with '2T TGCATTTCAGC'
$ perl -pe 's/(\w+)\/(\w+) (\w+)_(\w+)/2$2 $3$2$4/' seqs.fa > seqs.allel2.fa
# remove the 1/2 before allels after sort
$ seqkit sort seqs.allel*.fa | perl -pe 's/\|\d(\w+) /\|$1 /'
>2|28804|28804|rs544393009|dbSNP|C TGCATCTCAGC
ACTGN
>2|28804|28804|rs544393009|dbSNP|T TGCATTTCAGC
ACTGN
>2|29358|29358|rs943169482|dbSNP|G CCCAAGAAGCC
actgn
>2|29358|29358|rs943169482|dbSNP|GAAAA CCCAAGAAAAAAGCC
actgn
Please reformat the sequences using
101010
button. It confused others!!!Are these the original sequences?
Unformated FASTA will be shown as bellow:
Or, I guess you find they were mixed in one line, so you manually press ENTER between sequences, so it looks like this:
Thank you shenwei. I already edit my questions. Sorry about my careless.
Thanks Wei Shen; Thanks affaid_tyj. I will add a new answer to my current answer shortly.