Extract codon between two flanking sequences from FASTA

0

Entering edit mode

4.5 years ago

renyulb • 0

Hi all, given two 10bp flanking sequences, I would like to extract the codon between them across all samples in a FASTA file. For example:

Flank1:
CAGGCATGCC
Flank2:
TCATCGCTGG

FASTA

>sample1
GCGCACCATGGTCAGGCATGCCTCCTCATCGCTGGGCACAGCCCAGAGGGT
>sample2
GGCAGAACCCGCGCACCATGGTCAGGCATGCCACCTCATCGCTGGGCACAGCCCAGA
>sample3
GGCAGATTCCCCGCACCATGGTCAGGCATGCCACTTCATCGCTGGGCACA

Output

>sample1
TCC
>sample2
ACC
>sample3
ACT

I have performed the opposite of this where I extract the flanking sequences based on coordinates, as well as extracting the sequences between two coordinates using bedtools getfasta, but struggling with extracting based on flanking nucleotide sequences. Thanks for any help!

genome sequencing • 861 views

ADD COMMENT • link 4.5 years ago by renyulb • 0

0

Entering edit mode

in perl

$seq =~ /CAGGCATGCC ([ATGC]{3})+ TCATCGCTGG/x

print $1

ADD REPLY • link 4.5 years ago by Sishuo Wang ▴ 230

0

Entering edit mode

This can work if the sequence is not wrapped.

perl -n -e '{if($_ =~ /CAGGCATGCC(.+)TCATCGCTGG/){print $x, $1, "\n" };$x=$_}' Test.fa

ADD REPLY • link 4.5 years ago by microfuge ★ 2.0k

Login before adding your answer.