Entering edit mode
4.5 years ago
renyulb
•
0
Hi all, given two 10bp flanking sequences, I would like to extract the codon between them across all samples in a FASTA file. For example:
Flank1:
CAGGCATGCC
Flank2:
TCATCGCTGG
FASTA
>sample1
GCGCACCATGGTCAGGCATGCCTCCTCATCGCTGGGCACAGCCCAGAGGGT
>sample2
GGCAGAACCCGCGCACCATGGTCAGGCATGCCACCTCATCGCTGGGCACAGCCCAGA
>sample3
GGCAGATTCCCCGCACCATGGTCAGGCATGCCACTTCATCGCTGGGCACA
Output
>sample1
TCC
>sample2
ACC
>sample3
ACT
I have performed the opposite of this where I extract the flanking sequences based on coordinates, as well as extracting the sequences between two coordinates using bedtools getfasta, but struggling with extracting based on flanking nucleotide sequences. Thanks for any help!
in perl
This can work if the sequence is not wrapped.
perl -n -e '{if($_ =~ /CAGGCATGCC(.+)TCATCGCTGG/){print $x, $1, "\n" };$x=$_}' Test.fa