Most efficient way to trim overhanging bases after alignment
3
0
Entering edit mode
8.3 years ago
joreamayarom ▴ 140

I have used muscle to align a newly assembled genome to a reference genome. The seqs.afa file looks like this:

>reference
--------------------------ACTGAC
ACTGACTGACTGACTGACTGACTGACTGACTG
... Lots of bases here .........
ACTGACTGACTGACTGACTGACTGACTG----
-----------------

>my_assembly
AAAAAAAAAAAAAAAAAAAAAAAAAAACTGAC
ACTGACTGACTGACTGACTGACTGACTGACTG
... Lots of bases here .........
ACTGACTGACTGACTGACTGACTGACTGAAAA
AAAAAAAAAAAAAAAAA

As you can see, my program has a tendency to leave dangling bases downstream and upstream the reference genome. and I need to get rid of them in post processing. Is there a program or simple Python script I can use to trim the bases that overhang from the reference. How to do it efficiently? (I have a considerable amount of data sets)

muscle fasta • 3.3k views
ADD COMMENT
1
Entering edit mode
8.3 years ago
5heikki 11k

Get inspired by this

ADD COMMENT
0
Entering edit mode

I adapted the script and it doesn't work. Sometimes, if one of the aligned sequences has a relatively big gap in the middle, everything downstream is eliminated.

ADD REPLY
0
Entering edit mode

Is your input data in the same format as OP's of that post? No linebreaks in sequences?

ADD REPLY
0
Entering edit mode
8.3 years ago

trimal (http://trimal.cgenomics.org/publications) can trim multiple sequence alignment results

ADD COMMENT
0
Entering edit mode
8.2 years ago
Suzanne ▴ 100

Jalview (www.jalview.org) is a good visualisation workbench for multiple sequence alignments. Along with several useful editing features (check out the youtube video), it has pad gaps feature that can be toggled on and off. When selected, the alignment will be kept at a minimal width (so there are no empty columns before or after the first or last aligned residue) and all sequences will be padded with gap characters before and after their terminating residues. The pad gaps feature is demonstrated at 3.20min in this video. The sequence can then be exported in a variety of file formats.

ADD COMMENT

Login before adding your answer.

Traffic: 1482 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6