Hello, I would like some suggestions for the following problems. I have many multiple alignments from mafft, in fasta format. Eahc of these alignment has a specific sequence for which I am interested to know 1 thing: 1) is there at least a gapless alignment block length of at least 180 base pairs?
Here is a sample, let's say I am interested to query the sequence "Mitra_TRINITY_DN40971_c28_g1_i1" and check for the length of alignment block in hope to find one of at least 180 bp (I don't know if the example I give here satisfies this condition it's just to display the file format).
Any existing parser or quick awk hack (in fact I feel there should be a rather easy awk hack but can't figure it out)? Thank you for any guidance :)
>Mitra_TRINITY_DN40971_c28_g1_i1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------t--------------------------------------ctc
aattattactgctt------tccaaaatcatgctgcac------------atcaattaat
gaggtgat-------ctgtcatggaattcctgacatgagaccattagaa--agtggtgac
attatcaacattgacat-----cactgcctaccatcggaattttcatggagacttgaacg
agaccatgtttgtcggggaggtggatgaggccagcaaagagctagtgaagacgacgcatg
aatgtctgatgatggccatcgatgaagtgaaaccaggggtg-------------------
------------------------------------------------------------
------cggtacagagagatgggaaacatcatccagaaa---cacgcccagtctcatggc
tactctgttgtccgcagctactgcggtcatgggattcaccaattgttccacactgctcca
agtgtgcctcattatgggaagaataaa------------gccatcggtgtgatgaaacca
ggtcatactttcaccattgagccaatgatatctcaaggcacctggcgtgatgagatgtgg
ccagacaactggactgcagtgacacaggacggcaaacgctcagcgcagtttgagcacaca
ctgctggtgact------------------------------------------------
-------------------------gacacaggatgcgaagtcttgacccgtcgtcgggc
aaatgatggacagccccactttatggactccaaatga------------------tggac
agcca-------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
----------------------------------------------------
>Mitra_TRINITY_DN40971_c28_g2_i1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
---------------------------------------------------------gtg
ctcttccgatctgccaaagaaacac-----------------------------------
---------------------aggaggagagcaagaatggttacaacccattccccggct
ttcgttacacgggatcattgagagcattccctgtgacaccaaagagag------------
-agatgcctgacacaatccctcgtcctgattatgctgaccattcagaaggtatccccctg
tcagagcgccagatgagaggttcgacaaacatcaaacagctggacgatgaagagcaggaa
ggcatgcgagttgtgtgcaagttggctcgggaaatcctggatgaagcagccagtgctgtt
ggcattggggtaactacagatgaaattgacagactggtccatgagg-caacaatagacag
agagtgctatccatctcct--------------------------------------ctg
aactactactgctt------tcccaaatcatgctgcac------------gtctatcaat
gaagtgat-------ctgtcatggaattcctgacacgaggccattggaa--aatggtgac
attgtcaacattgacat-----tactgcgtaccatcggagctttcatggggacttaaacg
agaccatgtttgtcggagaggtagacgaggccagcaaggagctggtgaagaccacacacg
aatgcctcatgatggccattgaagaagtgaaacccggtgtg-------------------
------------------------------------------------------------
------cggtacagagagatgggtaacatcatccagaag---catgcccaggcccatgga
tactctgttgttcgcagctactgtggccatggt--------------------------
see if this helps:
seqkit can be downloaded from here
with awk:
First part of the code linearizes the fasta (from web/biostars) and second part does the job.