Question

Match, count and extract the positions of exact strings of nucleotides from multiple fasta files

2

Entering edit mode

7.8 years ago

ghataksnd ▴ 20

Hi All Members This is my first post. I am new to bioinformatics. Please help. I have a couple of bacterial genomes in fasta files. I want to search, count and extract the start and end positions of homopolymer repeats (strings of A's, G's, C's, T's) in them. The output should preferably in a tab delimited text file. For example, My fasta files are

F1.fasta

>X complete genome
ATGCCGATCTTTTCCCCCCTTAAAAAAAGGAAAAGGGAAAGGGGCCCTCAAAAAAAAAGGTACCTACGATCGA

F2.fasta

>Y complete genome
ATGCCGATCTTTTCCCCCCGCAAAAAGGAAAAGGGAAAGGGAAAGCCCAAAAAAAAAAAGGTGCCTTCGATCGATT

Now, if I search for 'AAA', 'AAAA' and 'TT' the output should ideally be

AAA(tab)F1.fasta or X(tab)1(tab)38(tab)40(newline)
AAA(tab)F2.fasta or Y(tab)2(tab)36(tab)38(newline)
AAA(tab)F2.fasta or Y(tab)2(tab)42(tab)44(newline)
AAAA(tab)F1.fasta or X(tab)1(tab)31(tab)34(newline)
AAAA(tab)F2.fasta or Y(tab)1(tab)29(tab)32(newline)
TT(tab)(F1.fasta or X(tab)1(tab)20(tab)21(newline)
TT(tab)(F2.fasta or Y(tab)2(tab)66(tab)67(newline)
TT(tab)(F2.fasta or Y(tab)2(tab)75(tab)75(newline)

*Other strings of A's or T's should not be counted.

Could anyone please help me with a solution? Any help will be appreciated.

fasta match position count pattern-matching • 6.6k views

ADD COMMENT • link updated 7.8 years ago by Petr Ponomarenko ★ 2.8k • written 7.8 years ago by ghataksnd ▴ 20

1

Entering edit mode

Is this a homework? Have you tried something??

ADD REPLY • link 7.8 years ago by shenwei356 8.7k

0

Entering edit mode

Not at all. The question might be very novice. But I am just new in the area of bioinformatics.

ADD REPLY • link 7.8 years ago by ghataksnd ▴ 20

0

Entering edit mode

What's your original purpose? It does looks like a homework. If not, you can write a tiny script in Perl/Python to fulfill your special out format.

seqkit locate can do similar job but the output format is a little different.

$ seqkit locate -f motifs.fa *.fasta
seqID   patternName   pattern   strand   start   end   matched
X       AAA           AAA       +        22      24    AAA
X       AAAA          AAAA      +        22      25    AAAA
X       TT            TT        +        10      11    TT
Y       AAAA          AAAA      +        22      25    AAAA
Y       TT            TT        +        10      11    TT
Y       AAA           AAA       +        22      24    AAA

ADD REPLY • link 7.8 years ago by shenwei356 8.7k

0

Entering edit mode

Hi shenwei356

Thank you so much for your kind reply. Yes your seqkit does a fine job in motif search. However, if you look into the result the 'AAA' (strictly the triplet) appears in the F1.fasta between 38 and 40. At position 22 to 24 it is part of a longer string of A's. My question is how can I get the exact matches and their positions and counts?

I tried grep as below

grep -Eo 'AAA|AAAA|AAAAA|AAAAAA|AAAAAAA|AAAAAAAA|TT|TTT|TTTT|TTTTT|TTTTTT' *.fsa | sort | uniq -c

Results are as below without the positions Count Name String 1 F1.fsa:AAA 1 F1.fsa:AAAA 1 F1.fsa:AAAAAAA 1 F1.fsa:AAAAAAAA 1 F1.fsa:TT 1 F1.fsa:TTTT 3 F2.fsa:AAA 1 F2.fsa:AAAA 1 F2.fsa:AAAAA 1 F2.fsa:AAAAAAAA 2 F2.fsa:TT

1 F2.fsa:TTTT

Finally, no it is not homework, nor class assignment, though work indeed it is. I am looking into the bacterial genomes which are AT rich. Yet we have many mononucleotide repeats of G and C. Since mononucleotide repeats are known to have some regulatory function in gene expression, we want to explore this. Location is important because I want to know whether these repeats are within the CDS. If so which protein do these CDS code for? I hope I could make myself clear that it is not homework! Any help will be appreciated.

ADD REPLY • link 7.8 years ago by ghataksnd ▴ 20

1

Entering edit mode

You may change the motifs (regular expression), e.g.

$ cat motifs.fa 
>As
A{3,}
>Ts
T{3,}
>ATs
[AT]{6,}

$ seqkit locate -P -f motifs.fa *.fasta \
    | csvtk -t sort -k patternName  -k seqID -k start \
    | csvtk -t pretty

seqID   patternName   pattern    strand   start   end   matched
X       ATs           [AT]{6,}   +        20      28    TTAAAAAAA
X       ATs           [AT]{6,}   +        50      58    AAAAAAAAA
Y       ATs           [AT]{6,}   +        49      59    AAAAAAAAAAA
X       As            A{3,}      +        22      28    AAAAAAA
X       As            A{3,}      +        31      34    AAAA
X       As            A{3,}      +        38      40    AAA
X       As            A{3,}      +        50      58    AAAAAAAAA
Y       As            A{3,}      +        22      26    AAAAA
Y       As            A{3,}      +        29      32    AAAA
Y       As            A{3,}      +        36      38    AAA
Y       As            A{3,}      +        42      44    AAA
Y       As            A{3,}      +        49      59    AAAAAAAAAAA
X       Ts            T{3,}      +        10      13    TTTT
Y       Ts            T{3,}      +        10      13    TTTT

ADD REPLY • link 7.8 years ago by shenwei356 8.7k

0

Entering edit mode

Hi shenwei356 Thank you for your reply. Yes seqkit tool is also working nice with your suggestion. Thanks.

ADD REPLY • link 7.8 years ago by ghataksnd ▴ 20

score 1 · Answer 1 · 2017-02-18

1

Entering edit mode

7.8 years ago

Petr Ponomarenko ★ 2.8k

Here is an awk one-liner for your task. It works with fasta files containing multiple sequences. I do not understand what 1 and 2 after X and Y means in your example output. Hope you can change my code a bit to get formatting you want. This looks for all single nucleotide repeats and encodes them as An so for AAA it is A3. This is just to make your work easier with super big files where long strings of Ns are possible.

awk '$0~">"{name=$1;pos=1}$0!~">"{pos_in=1;while(nuc=substr($0,pos_in,1)){pos_out=pos_in+1;while(nuc==substr($0,pos_out,1)){pos++;pos_out++};len=pos_out-pos_in;if(len>1){print name,nuc"*"len,pos-len+1,pos};pos_in=pos_out;pos++}}' F1.fasta

awk reads file line by line. If the line contains > it considers it the name of the sequence and sets position counter to 1. Otherwise, it goes in line to read the sequence and first sets local position counter pos_in to 1. Then it uses while loop to read every character if the next character is the same it opens a new while loop that continues to read characters one by one until character changes. Right after that loop awk prints out information about the repeat and then continue to the main while loop. Fasta file can have sequence spread across multiple lines so it is also pos for a global position in your sequence and pos_in that is local for the current line.

You can add FILENAME variable in awk print statement so you can work efficiently with multiple files.

ADD COMMENT • link 7.8 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Hi Petr Ponomarenko Thank you so much for the one liner. It is working perfectly. As you suggested I have added a header row and FILENAME variable and your code is working nice. Now it looks like awk 'BEGIN{print "File Name\tGenome\tRepeat\tStart\tEnd"}; $0~">"{name=$1;pos=1}$0!~">"{pos_in=1;while(nuc=substr($0,pos_in,1)){pos_out=pos_in+1;while(nuc==substr($0,pos_out,1)){pos++;pos_out++};len=pos_out-pos_in;if(len>1){print FILENAME "\t",name "\t",nuc"*"len "\t",pos-len+1 "\t",pos};pos_in=pos_out;pos++}}' *.fsa Thank you. YOU ARE A STAR :)

ADD REPLY • link 7.8 years ago by ghataksnd ▴ 20

1

Entering edit mode

You are welcome.

As a side note task for all of us who learn bioinformatics here: Can anyone come up with a way to count all potential repeats up to N long, so AGAGAGAG, ATTATTATT, etc can be detected as well? Can it be done in one-liner with awk or it is more efficient with functions and thus python, ruby, c++, perl, R, etc?

ADD REPLY • link 7.8 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

just use regular expression I mentioned above, e.g., [AG]{6,}, [AT]{6,}.

ADD REPLY • link 7.8 years ago by shenwei356 8.7k

1

Entering edit mode

Edited. I forgot that [at]===a or t while (at) is for grouping with order and matches only to at.

I understand that concept, but what the topic starter was asking is exact match so regexp is [^A][A]{3}[^A]. This is why in your output

X       As            A{3,}      +        50      58    AAAAAAAAA

is not what ghataksnd wants. And my question is to write an efficient code that can find all possible exact repeats of repeated sequence length up to N. So the code looks at AA, AAA, ATAT, ATTATT, ATCATC, ATCATCATC and so on automatically. So if the sequence is TAAAGAGG the answer should be:

AAA 1 4
AGAG 4 7
GG 7 8

or

A*3 1 4
AG*2 4 7
G*2 7 8

ADD REPLY • link 7.8 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Yes that is exactly what I was looking for. And the side question you asked opened a new idea of finding exact repeats.

ADD REPLY • link 7.8 years ago by ghataksnd ▴ 20

0

Entering edit mode

I'm a little confused, we just need change the regular expressions.

Your AG*2 in regular expression is (AG){2}, A*3 is A{3}.

[^AG] at the start and end of [^AG][AG]{6}[^AG] is redundant. You may want (AG){6}.

ADD REPLY • link 7.8 years ago by shenwei356 8.7k

1

Entering edit mode

Now I am a bit confused. How can you use regexp that does not say anything about boundaries of the string if boundaries have to be very specific and thus included in the search? When topic starter search for AA in string TAAAAAGG in the output he/she wants to see nothing at all! If you make regexp (A){2} it will match to positions 2, 3, 4 and 5. Topic starter wants regexp for AA to match nothing at all in that string and the only way I can think of is to include information about the boundaries and use [^A][A]{2}[^A] so before and after AA there should be no A.

ADD REPLY • link 7.8 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

That is why in this particular case awk oneliner suggested by Petr Ponomarenko is more suitable than the seqkit. Boundaries are necessary. Because match(es) has/have to be 'exact'