Extracting the coordinates for a specific base from multifasta
3
0
Entering edit mode
6.1 years ago
Louis Kok ▴ 30

Hi, I would like to extract coordinates for a specific base, let say all "A" base from multi-fasta. The way I I used to extract from a single sequence (without header) is as below:

grep -aob "A"  input.seq | sed 's/:/ /g' | awk '{print $1+1}'

Is there an easier way to deal with large and multi fasta? Thanks.

sequence fasta • 1.7k views
ADD COMMENT
3
Entering edit mode
6.1 years ago

linearize, read each pair of name/sequence; print the name ; echo the sequence, convert each base to a new line with grep -o ., get the line number with nl, grep the lines ending with 'A'.

 awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fasta  |\
while IFS=$'\t' read A B; do echo $A && echo $B | grep -o . | nl | grep 'A$' ;done

>seq1
     5  A
    12  A
    14  A
    18  A
    23  A
    27  A
    30  A
    32  A
    44  A
>seq2
     7  A
     8  A
    10  A
    14  A
    18  A
    19  A
    23  A
    26  A
    28  A
    34  A
    40  A
    43  A
ADD COMMENT
3
Entering edit mode
6.1 years ago

using seqkit:

$ seqkit locate -ip "a" test.fa 

seqID   patternName pattern strand  start   end matched
seq1    a   a   +   1   1   a
seq1    a   a   +   7   7   a
seq1    a   a   +   11  11  a
seq1    a   a   -   10  10  a
seq1    a   a   -   4   4   a
seq2    a   a   +   1   1   a
seq2    a   a   +   7   7   a
seq2    a   a   +   11  11  a
seq2    a   a   -   10  10  a
seq2    a   a   -   4   4   a

$ cat test.fa 
>seq1
agctggagctacc
>seq2
agctggagctacc
ADD COMMENT
1
Entering edit mode
6.1 years ago
Joe 22k

A Python option:

# Define a test sequence
>>> string = 'ATAGCTACGTACGTACGTACGTACGTACGA'

# A function for finding all substrings
>>> find_all = lambda c,s: [x for x in range(c.find(s), len(c)) if c[x] == s]

# Gives the following indices for 'A' characters
>>> find_all(string, 'A')
[0, 2, 6, 10, 14, 18, 22, 26, 29]

# To apply it to all sequences in a multifasta:
>>> from Bio import SeqIO
>>> seqs = list(SeqIO.parse('seqs.fa','fasta'))
>>> [find_all(str(s.seq), 'A') for s in seqs]
[[80, 85, 92, 97, 98, 101, 103, 119, 121, 122, 125, 127, 129, 130, 148],    # First seq
 [86, 98, 103, 104, 107, 125, 127, 128, 131, 133, 135, 136, 154],           # Second seq, etc.
 [74, 79, 86, 91, 92, 95, 113, 115, 116, 119, 121, 123, 124, 142], 
 [74, 79, 86, 91, 92, 95, 113, 115, 116, 119, 120, 121, 123, 124, 142]]

(Input data used)

>mutant
GTTGGGAGGCTATGTGTTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>gorrila
GTTGGGAGGCTATGTGTGACTGGAAGGACATCCTGTCGGGTGGCGAGAAGCAGAGAATC
>chimpanze
GTTGGGAGGCTGTGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>human
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAGAGAATC
>olive
GTTGGGAGGCTATGTGTGACTGGAAGGACGTCCTGTCGGGTGGCGAGAAGCAAAGAATC
ADD COMMENT

Login before adding your answer.

Traffic: 2997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6