Hi All Members This is my first post. I am new to bioinformatics. Please help. I have a couple of bacterial genomes in fasta files. I want to search, count and extract the start and end positions of homopolymer repeats (strings of A's, G's, C's, T's) in them. The output should preferably in a tab delimited text file. For example, My fasta files are
F1.fasta
>X complete genome
ATGCCGATCTTTTCCCCCCTTAAAAAAAGGAAAAGGGAAAGGGGCCCTCAAAAAAAAAGGTACCTACGATCGA
F2.fasta
>Y complete genome
ATGCCGATCTTTTCCCCCCGCAAAAAGGAAAAGGGAAAGGGAAAGCCCAAAAAAAAAAAGGTGCCTTCGATCGATT
Now, if I search for 'AAA', 'AAAA' and 'TT' the output should ideally be
- AAA(tab)F1.fasta or X(tab)1(tab)38(tab)40(newline)
- AAA(tab)F2.fasta or Y(tab)2(tab)36(tab)38(newline)
- AAA(tab)F2.fasta or Y(tab)2(tab)42(tab)44(newline)
- AAAA(tab)F1.fasta or X(tab)1(tab)31(tab)34(newline)
- AAAA(tab)F2.fasta or Y(tab)1(tab)29(tab)32(newline)
- TT(tab)(F1.fasta or X(tab)1(tab)20(tab)21(newline)
- TT(tab)(F2.fasta or Y(tab)2(tab)66(tab)67(newline)
- TT(tab)(F2.fasta or Y(tab)2(tab)75(tab)75(newline)
*Other strings of A's or T's should not be counted.
Could anyone please help me with a solution? Any help will be appreciated.
Is this a homework? Have you tried something??
Not at all. The question might be very novice. But I am just new in the area of bioinformatics.
What's your original purpose? It does looks like a homework. If not, you can write a tiny script in Perl/Python to fulfill your special out format.
seqkit locate
can do similar job but the output format is a little different.Hi shenwei356
Thank you so much for your kind reply. Yes your seqkit does a fine job in motif search. However, if you look into the result the 'AAA' (strictly the triplet) appears in the F1.fasta between 38 and 40. At position 22 to 24 it is part of a longer string of A's. My question is how can I get the exact matches and their positions and counts?
I tried grep as below
grep -Eo 'AAA|AAAA|AAAAA|AAAAAA|AAAAAAA|AAAAAAAA|TT|TTT|TTTT|TTTTT|TTTTTT' *.fsa | sort | uniq -c
Results are as below without the positions Count Name String 1 F1.fsa:AAA 1 F1.fsa:AAAA 1 F1.fsa:AAAAAAA 1 F1.fsa:AAAAAAAA 1 F1.fsa:TT 1 F1.fsa:TTTT 3 F2.fsa:AAA 1 F2.fsa:AAAA 1 F2.fsa:AAAAA 1 F2.fsa:AAAAAAAA 2 F2.fsa:TT
1 F2.fsa:TTTT
Finally, no it is not homework, nor class assignment, though work indeed it is. I am looking into the bacterial genomes which are AT rich. Yet we have many mononucleotide repeats of G and C. Since mononucleotide repeats are known to have some regulatory function in gene expression, we want to explore this. Location is important because I want to know whether these repeats are within the CDS. If so which protein do these CDS code for? I hope I could make myself clear that it is not homework! Any help will be appreciated.
You may change the motifs (regular expression), e.g.
Hi shenwei356 Thank you for your reply. Yes seqkit tool is also working nice with your suggestion. Thanks.