Question

Replace _ in sequences with variant allele in fasta header

1

Entering edit mode

7.7 years ago

Yijun Tian ▴ 20

Hello everyone,

I got a few flanking sequences around a bunch of SNPs. Since they were export from biomart, the variant allele was like a _:

>2|28804|28804|rs544393009|dbSNP|C/T
  TGCAT_TCAGC
>2|29358|29358|rs943169482|dbSNP|G/GAAAA
  CCCAA_AAGCC
>2|28396|28396|rs781239227|dbSNP|T/C
  AATGC_CTTGG

Can some experts help me to replace the _ in each fasta sequence with its header description? That is each fasta input will be turned in 2 or 3 flanking sequence:

>2|28804|28804|rs544393009|dbSNP|C
  TGCATCTCAGC
>2|28804|28804|rs544393009|dbSNP|T
  TGCATTTCAGC

And that is what I am desperately seeking for!!!

Thanks in advance!

genome • 2.7k views

ADD COMMENT • link 7.7 years ago by Yijun Tian ▴ 20

1

Entering edit mode

Please reformat the sequences using 101010 button. It confused others!!!

Are these the original sequences?

>2|28804|28804|rs544393009|dbSNP|C/T
TGCAT_TCAGC
>2|29358|29358|rs943169482|dbSNP|G/GAAAA
CCCAA_AAGCC
>2|28396|28396|rs781239227|dbSNP|T/C
AATGC_CTTGG

Unformated FASTA will be shown as bellow:

2|28804|28804|rs544393009|dbSNP|C/T TGCAT_TCAGC 2|29358|29358|rs943169482|dbSNP|G/GAAAA CCCAA_AAGCC 2|28396|28396|rs781239227|dbSNP|T/C AATGC_CTTGG

Or, I guess you find they were mixed in one line, so you manually press ENTER between sequences, so it looks like this:

2|28804|28804|rs544393009|dbSNP|C/T TGCAT_TCAGC

2|29358|29358|rs943169482|dbSNP|G/GAAAA CCCAA_AAGCC

2|28396|28396|rs781239227|dbSNP|T/C AATGC_CTTGG

ADD REPLY • link 7.7 years ago by shenwei356 8.7k

0

Entering edit mode

Thank you shenwei. I already edit my questions. Sorry about my careless.

ADD REPLY • link 7.7 years ago by Yijun Tian ▴ 20

0

Entering edit mode

Thanks Wei Shen; Thanks affaid_tyj. I will add a new answer to my current answer shortly.

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

score 2 · Accepted Answer · 2017-09-15

2

Entering edit mode

7.7 years ago

Kevin Blighe 89k

This was somewhat difficult and posed a challenge. Originally I started with just sed and awk but it got very complex with them.

Nevertheless, if the format of your input file is VERY consistent, the following should work (assuming your input is in snps.txt):

paste <(cut -f1 -d_ snps.txt  | sed 's/\/[ATGC]* / /g') <(sed 's/[0-9XY]*|[0-9]*|[0-9]*|rs[0-9]*|dbSNP|//g' snps.txt | sed 's/\/[ATGC]* [ATGC]*_[ATGC]*//g') <(cut -f2 -d_ snps.txt) -d ""

2|28804|28804|rs544393009|dbSNP|C TGCATCTCAGC
2|29358|29358|rs943169482|dbSNP|G CCCAAGAAGCC
2|28396|28396|rs781239227|dbSNP|T AATGCTCTTGG

...or, to break it down a bit:

sed 's/[0-9XY]*|[0-9]*|[0-9]*|rs[0-9]*|dbSNP|//g' snps.txt | sed 's/\/[ATGC]* [ATGC]*_[ATGC]*//g' > RefAlleles.list
paste <(cut -f1 -d_ snps.txt  | sed 's/\/[ATGC]* / /g') <(cat RefAlleles.list) <(cut -f2 -d_ snps.txt) -d ""

ADD COMMENT • link 7.7 years ago by Kevin Blighe 89k

0

Entering edit mode

How is the code going to give for the right side of the / ? Since for some indels, just like sequence 2, the 2 alleles are like G vs GAAA .

ADD REPLY • link 7.7 years ago by Yijun Tian ▴ 20

0

Entering edit mode

Apologies, I have just considered the left side of the forward slash ('/'). Maybe this is a better answer:

Assuming your data is in snps.txt:

2|28804|28804|rs544393009|dbSNP|C/T TGCAT_TCAGC
2|29358|29358|rs943169482|dbSNP|G/GAAAA CCCAA_AAGCC
2|28396|28396|rs781239227|dbSNP|T/C AATGC_CTTGG


paste <(cut -f1 -d_ snps.txt | sed 's/\/[ATGC]* / /g') <(sed 's/[0-9XY]*|[0-9]*|[0-9]*|rs[0-9]*|dbSNP|//g' snps.txt | sed 's/\/[ATGC]* [ATGC]*_[ATGC]*//g') <(cut -f2 -d_ snps.txt) -d "" | sed 's/^/>/g' | sed 's/ /\n/g'
>2|28804|28804|rs544393009|dbSNP|C
TGCATCTCAGC
>2|29358|29358|rs943169482|dbSNP|G
CCCAAGAAGCC
>2|28396|28396|rs781239227|dbSNP|T
AATGCTCTTGG


paste <(cut -f1 -d"_" snps.txt | sed 's/[ATGC]*\///g') <(sed 's/[0-9XY]*|[0-9]*|[0-9]*|rs[0-9]*|dbSNP|[ATGC]*\///g' snps.txt | sed 's/ [ATGC]*_[ATGC]*//g') <(cut -f2 -d_ snps.txt) -d "" | sed 's/^/>/g' | sed 's/ /\n/g'
>2|28804|28804|rs544393009|dbSNP|T
TGCATTTCAGC
>2|29358|29358|rs943169482|dbSNP|GAAAA
CCCAAGAAAAAAGCC
>2|28396|28396|rs781239227|dbSNP|C
AATGCCCTTGG

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

0

Entering edit mode

If the input is definitely FASTA, like this:

>2|28804|28804|rs544393009|dbSNP|C/T
  TGCAT_TCAGC
>2|29358|29358|rs943169482|dbSNP|G/GAAAA
  CCCAA_AAGCC
>2|28396|28396|rs781239227|dbSNP|T/C
  AATGC_CTTGG

Here is a fix to my original code (note that it assumes that that the sequences are short and fit on a single line:

First I crete a temporary file where I get the sequences and headers on 1 line again:

paste -s -d' \n' snps.fasta > snps.fasta.tmp

Then:

paste <(cut -f1 -d_ snps.fasta.tmp | sed 's/\/[ATGC]* / /g') <(sed 's/>[0-9XY]*|[0-9]*|[0-9]*|rs[0-9]*|dbSNP|//g' snps.fasta.tmp | sed 's/\/[ATGC]* [ATGC]*_[ATGC]*//g') <(cut -f2 -d_ snps.fasta.tmp) -d "" | sed 's/^//g' | sed 's/ /\n/g' > snps.allele1.fasta

cat snps.allele1.fasta

>2|28804|28804|rs544393009|dbSNP|C
TGCATCTCAGC
>2|29358|29358|rs943169482|dbSNP|G
CCCAAGAAGCC
>2|28396|28396|rs781239227|dbSNP|T
AATGCTCTTGG

paste <(cut -f1 -d"_" snps.fasta.tmp | sed 's/[ATGC]*\///g') <(sed 's/>[0-9XY]*|[0-9]*|[0-9]*|rs[0-9]*|dbSNP|[ATGC]*\///g' snps.fasta.tmp | sed 's/ [ATGC]*_[ATGC]*//g') <(cut -f2 -d_ snps.fasta.tmp) -d "" | sed 's/ /\n/g' > snps.allele2.fasta

cat snps.allele2.fasta

>2|28804|28804|rs544393009|dbSNP|T
TGCATTTCAGC
>2|29358|29358|rs943169482|dbSNP|GAAAA
CCCAAGAAAAAAGCC
>2|28396|28396|rs781239227|dbSNP|C
AATGCCCTTGG

Wei Sheng may have an easier solution with his seqkit program.

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

score 2 · Accepted Answer · 2017-09-15

2

Entering edit mode

7.7 years ago

shenwei356 8.7k

That's not that hard by matching the SNPs and flanking sequences using a regular expression.

To make it easy to process, let's linearize the FASTA sequences, i.e. converting to tabular format. For example:

$ seqkit fx2tab seqs.fa 
2|28804|28804|rs544393009|dbSNP|C/T     TGCAT_TCAGC

Hereby, we can follow the original answer with little modification. Note that the 1 or 2 before allels are used to keeping order, which will be removed in the end.

# FASTA -> tabular
#     replace 'C/T\tTGCAT_TCAGC' with  '1C\tTGCATCTCAGC'
#     tabular -> FASTA
$ seqkit fx2tab seqs.fa | \
    perl -pe 's/(\w+)\/(\w+)\t(\w+)_(\w+)/1$1\t$3$1$4/' | \
    seqkit tab2fx > seqs.allel1.fa

# FASTA -> tabular
#     replace 'C/T\tTGCAT_TCAGC' with  '2T TGCATTTCAGC'
#     tabular -> FASTA
$ seqkit fx2tab seqs.fa | \
    perl -pe 's/(\w+)\/(\w+)\t(\w+)_(\w+)/2$2\t$3$2$4/' | \
    seqkit tab2fx > seqs.allel2.fa

Then sort records in the two files by sequence ID in alphabetical order

# sort by ID
# remove the 1/2 before alleles
$ seqkit sort seqs.allel*.fa |\
    perl -pe 's/\|\d(\w+)$/\|$1/'
>2|28396|28396|rs781239227|dbSNP|T
AATGCTCTTGG
>2|28396|28396|rs781239227|dbSNP|C
AATGCCCTTGG
>2|28804|28804|rs544393009|dbSNP|C
TGCATCTCAGC
>2|28804|28804|rs544393009|dbSNP|T
TGCATTTCAGC
>2|29358|29358|rs943169482|dbSNP|G
CCCAAGAAGCC
>2|29358|29358|rs943169482|dbSNP|GAAAA
CCCAAGAAAAAAGCC

Original answer:

That's not that hard by matching the SNPs and flanking sequences using a regular expression.

# replace 'C/T TGCAT_TCAGC' with  'C TGCATCTCAGC'
$ perl -pe 's/(\w+)\/(\w+) (\w+)_(\w+)/$1 $3$1$4/' seqs.fa > seqs.allel1.fa 
# replace 'C/T TGCAT_TCAGC' with  'T TGCATTTCAGC'
$ perl -pe 's/(\w+)\/(\w+) (\w+)_(\w+)/$2 $3$2$4/' seqs.fa > seqs.allel2.fa

If the order between two allels is not important, just sort them by sequence ID in alphabetical order:

$ seqkit sort seqs.allel*.fa
>2|28804|28804|rs544393009|dbSNP|C TGCATCTCAGC
ACTGN
>2|28804|28804|rs544393009|dbSNP|T TGCATTTCAGC
ACTGN
>2|29358|29358|rs943169482|dbSNP|G CCCAAGAAGCC
actgn
>2|29358|29358|rs943169482|dbSNP|GAAAA CCCAAGAAAAAAGCC
actgn

If the order is important, it can be done with a trick.

# replace 'C/T TGCAT_TCAGC' with  '1C TGCATCTCAGC'
$ perl -pe 's/(\w+)\/(\w+) (\w+)_(\w+)/1$1 $3$1$4/' seqs.fa > seqs.allel1.fa 
# replace 'C/T TGCAT_TCAGC' with  '2T TGCATTTCAGC'
$ perl -pe 's/(\w+)\/(\w+) (\w+)_(\w+)/2$2 $3$2$4/' seqs.fa > seqs.allel2.fa

# remove the 1/2 before allels after sort
$ seqkit sort seqs.allel*.fa | perl -pe 's/\|\d(\w+) /\|$1 /'
>2|28804|28804|rs544393009|dbSNP|C TGCATCTCAGC
ACTGN
>2|28804|28804|rs544393009|dbSNP|T TGCATTTCAGC
ACTGN
>2|29358|29358|rs943169482|dbSNP|G CCCAAGAAGCC
actgn
>2|29358|29358|rs943169482|dbSNP|GAAAA CCCAAGAAAAAAGCC
actgn

ADD COMMENT • link 7.7 years ago by shenwei356 8.7k

0

Entering edit mode

It is so wise to take my sequence as fasta format and considered the order between each flanking sequences. Thank you so much for your help!!!

ADD REPLY • link 7.7 years ago by Yijun Tian ▴ 20

0

Entering edit mode

If it helps, please upvote and accept this answer.

ADD REPLY • link 7.7 years ago by shenwei356 8.7k

0

Entering edit mode

I don't believe that your result is what was requested.

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

1

Entering edit mode

Thanks for pointing out, I just notice that OP did not rightly format the code. Answer is updated.

ADD REPLY • link 7.7 years ago by shenwei356 8.7k

0

Entering edit mode

Great to know you Wei Shen!

ADD REPLY • link 7.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Shenwei,

What if I want to replace _ with some other codes like - %? Can I still use w+ to represent them?

ADD REPLY • link 7.7 years ago by Yijun Tian ▴ 20

2

Entering edit mode

no, use .. you can "learn regular expression in 30 minutes".

by the way, use single back quote to quote inline code or special symbols.

ADD REPLY • link 7.7 years ago by shenwei356 8.7k

0

Entering edit mode

Hi Mr Shenwei,

I just want to say that seqkit is totally a miracle.

So powerful!

Thank you so much!

ADD REPLY • link 7.7 years ago by Yijun Tian ▴ 20