Question

Most efficient way to trim overhanging bases after alignment

0

Entering edit mode

8.6 years ago

joreamayarom ▴ 140

I have used muscle to align a newly assembled genome to a reference genome. The seqs.afa file looks like this:

>reference
--------------------------ACTGAC
ACTGACTGACTGACTGACTGACTGACTGACTG
... Lots of bases here .........
ACTGACTGACTGACTGACTGACTGACTG----
-----------------

>my_assembly
AAAAAAAAAAAAAAAAAAAAAAAAAAACTGAC
ACTGACTGACTGACTGACTGACTGACTGACTG
... Lots of bases here .........
ACTGACTGACTGACTGACTGACTGACTGAAAA
AAAAAAAAAAAAAAAAA

As you can see, my program has a tendency to leave dangling bases downstream and upstream the reference genome. and I need to get rid of them in post processing. Is there a program or simple Python script I can use to trim the bases that overhang from the reference. How to do it efficiently? (I have a considerable amount of data sets)

muscle fasta • 3.5k views

ADD COMMENT • link updated 8.5 years ago by Suzanne ▴ 100 • written 8.6 years ago by joreamayarom ▴ 140

score 1 · Answer 1 · 2016-09-19

1

Entering edit mode

8.6 years ago

5heikki 11k

Get inspired by this

ADD COMMENT • link 8.6 years ago by 5heikki 11k

0

Entering edit mode

I adapted the script and it doesn't work. Sometimes, if one of the aligned sequences has a relatively big gap in the middle, everything downstream is eliminated.

ADD REPLY • link 8.6 years ago by joreamayarom ▴ 140

0

Entering edit mode

Is your input data in the same format as OP's of that post? No linebreaks in sequences?

ADD REPLY • link 8.6 years ago by 5heikki 11k

score 0 · Answer 2 · 2016-09-19

0

Entering edit mode

8.6 years ago

shenwei356 8.7k

trimal (http://trimal.cgenomics.org/publications) can trim multiple sequence alignment results

ADD COMMENT • link 8.6 years ago by shenwei356 8.7k

score 0 · Answer 3 · 2016-10-20

Jalview (www.jalview.org) is a good visualisation workbench for multiple sequence alignments. Along with several useful editing features (check out the youtube video), it has pad gaps feature that can be toggled on and off. When selected, the alignment will be kept at a minimal width (so there are no empty columns before or after the first or last aligned residue) and all sequences will be padded with gap characters before and after their terminating residues. The pad gaps feature is demonstrated at 3.20min in this video. The sequence can then be exported in a variety of file formats.