Question

Extract sequences from a fasta file with specific nucleotide repetition

0

Entering edit mode

3.1 years ago

shabbas12 ▴ 10

I have a fasta file name seqs.fa with multiple sequences i.e.,

>Seq1
GATAGAT**ATC**GAATG**ATC**

>Seq2
GATGATAG**ATC**GATGC

I want grep/extract only those sequences having ATC repeated exactly 2 times like in Seq1. How we can use grep/sed or {} method for this purpose?

linux command • 1.8k views

ADD COMMENT • link updated 3.1 years ago by Michael 55k • written 3.1 years ago by shabbas12 ▴ 10

score 0 · Answer 1 · 2021-10-15

In principle, regular expressions do not count occurrences exactly (at least easily) internally, you might think of /(ATC.*ATC){1,1}/ but that does not work.

So it is easier to use a perl "one-liner", like so:

 perl -e ' $x =()= "ATCGGGGGGATC" =~ /ATC/ig; print "pattern found\n" if $x == 2 '

This code is just for demonstration, you need to add file parsing and output behavior depending on your requirements.

In case the pattern could overlap, like in find exactly 2x ATCATC and you want to include [ATC[ATC]ATC] as a valid hit, an additional check is required.

An alternative for making a regex that works in perl is to specify which pattern not to match in parts of the string, like so:

perl -e 'print "found\n" if  "ATCATATC" =~ /^(?:(?!ATC).)*ATC(?:(?!ATC).)*ATC(?:(?!ATC).)*$/;'

where (?:(?!ATC).)* matches everything except ATC.

score 0 · Answer 2 · 2021-10-15

You may need to take the reverse strand into consideration.

Both strands:

$ seqkit locate -i -p atc seqs.fa | column -t 
seqID  patternName  pattern  strand  start  end  matched
Seq1   atc          atc      +       8      10   atc
Seq1   atc          atc      +       16     18   atc
Seq1   atc          atc      -       15     17   atc
Seq1   atc          atc      -       5      7    atc
Seq1   atc          atc      -       1      3    atc
Seq2   atc          atc      +       9      11   atc
Seq2   atc          atc      -       12     14   atc
Seq2   atc          atc      -       8      10   atc
Seq2   atc          atc      -       4      6    atc
Seq2   atc          atc      -       1      3    atc

$ seqkit locate -i -p atc seqs.fa \
    | csvtk freq -t -f seqID -nr
seqID   frequency
Seq1    5
Seq2    5

# no desired sequences

Only forward strand:

$ seqkit locate -i -p atc --only-positive-strand  seqs.fa | column -t 
seqID  patternName  pattern  strand  start  end  matched
Seq1   atc          atc      +       8      10   atc
Seq1   atc          atc      +       16     18   atc
Seq2   atc          atc      +       9      11   atc

$ seqkit locate -i -p atc --only-positive-strand  seqs.fa \
    | csvtk freq -t -f seqID -nr
seqID   frequency
Seq1    2
Seq2    1

$ seqkit locate -i -p atc -P  seqs.fa \
    | csvtk freq -t -f seqID -nr \
    | awk '$2 == 2' \
    | cut -f 1 \
    | tee list.txt
Seq1

$ seqkit grep -f list.txt seqs.fa 
[INFO] 1 patterns loaded from file
>Seq1
GATAGATATCGAATGATC