I have used muscle to align a newly assembled genome to a reference genome. The seqs.afa file looks like this:
>reference
--------------------------ACTGAC
ACTGACTGACTGACTGACTGACTGACTGACTG
... Lots of bases here .........
ACTGACTGACTGACTGACTGACTGACTG----
-----------------
>my_assembly
AAAAAAAAAAAAAAAAAAAAAAAAAAACTGAC
ACTGACTGACTGACTGACTGACTGACTGACTG
... Lots of bases here .........
ACTGACTGACTGACTGACTGACTGACTGAAAA
AAAAAAAAAAAAAAAAA
As you can see, my program has a tendency to leave dangling bases downstream and upstream the reference genome. and I need to get rid of them in post processing. Is there a program or simple Python script I can use to trim the bases that overhang from the reference. How to do it efficiently? (I have a considerable amount of data sets)
I adapted the script and it doesn't work. Sometimes, if one of the aligned sequences has a relatively big gap in the middle, everything downstream is eliminated.
Is your input data in the same format as OP's of that post? No linebreaks in sequences?