How To Reformat Msa Output To Show Only 50 Nucleotide Sequences In Each Line
2
1
Entering edit mode
14.5 years ago
Thaman ★ 3.3k

I have Multiple sequence alignment (clustal) file and I want to read this file and arrange nucleotide sequences in such a way that it looks more clear and precise in order.

I am doing this from biopython using AlignIO object. My codes goes like this:

alignment = AlignIO.read("opuntia.aln", "clustal")
print "Number of rows: %i" % len(align)
for record in alignment:
    print "%s - %s" % record.id, record.seq)

I have been more fascinated by the output of http://www.ebi.ac.uk/Tools/clustalw2/ project because it looks clear.

My Output -- , it looks messy and long scrolling. What i want to do is print only 50 sequences (nucleotide) in each line and continue till the end of alignment file.

I wish to have output like this from http://www.ebi.ac.uk/Tools/clustalw2/.

Any suggestions, algorithm and sample code is appreciated

Here is the link to the input file-- http://pastebin.com/EaeKsyvg

Thanks in advance

Br,

clustalw multiple • 4.6k views
ADD COMMENT
1
Entering edit mode

By '50 sequences', do you mean 50 nucleotides? @Thaman: I have seen a few of your questions and they are all great, in the sens that they have the potential to be very useful to others. However, it would be much better if you took the time to formulate them more clearly, giving both more details about the context and what you desire, and also making it more English. That way, you will get much better replies and won't have to ask the same question twice because the first version was confusing, as happened recently. Thanks for your contribution! Cheers.

ADD REPLY
0
Entering edit mode

Sequence ID and 50 sequences? That's not clear.

ADD REPLY
0
Entering edit mode
gi|6273285|gb|AF191659.1|AF191      TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA 50
ADD REPLY
0
Entering edit mode

I mean like this...........

ID                             Sequences
gi|6273285-----TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
ADD REPLY
0
Entering edit mode
ID--------------------Sequence----------------------------------------
gi|6273285|gb|------TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
ADD REPLY
0
Entering edit mode
ID------Sequence----------------------------------------
gi|6273285|gb|----TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAA
gi|6273284|gb|----TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAA
gi|6273287|gb-----TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAA
ADD REPLY
0
Entering edit mode

@Eric, thanks for your kind suggestion, i have edited my query and make it more precise. I am just trying to make alignment output similar to http://www.ebi.ac.uk/Tools/clustalw2/ project. According to my output its long nucleotide record, but i want to split it into 50 nucleotide each in individual line. Thanks for listening

ADD REPLY
0
Entering edit mode

Could you please put the input data on the web. That way we can write a code and test it withouth having to generate the input data manually. Just paste it into http://pastebin.com/ and link it.

ADD REPLY
0
Entering edit mode

Could you please put the input data (alignment) on the web. That way we can write a code and test it without having to generate the input data manually. Just paste it into pastebin.com and link it

ADD REPLY
0
Entering edit mode

How can i link it to Istan Albert?

ADD REPLY
0
Entering edit mode

just edit the post and add the link into the main post

ADD REPLY
0
Entering edit mode

Ok now finally alignment file is here.

ADD REPLY
0
Entering edit mode

okei i provided the input file link in the main post.

ADD REPLY
0
Entering edit mode

The input file is the file you want to transform, the one that has long lines. The file that you provide seems like the result of the transformation.

ADD REPLY
0
Entering edit mode

I tried with readlines() and seems working, seems likes i couldn't make my problem clear to you.

ADD REPLY
5
Entering edit mode
14.5 years ago

Here is my solution with Python, using the formatter by nuin. It transforms data from the original format shown here to the a visually more pleasing format shown here

import string
from itertools import islice

# read the file
lines = file('input.txt').readlines()

# remove empty lines if present
lines = map(string.strip, lines)
lines = filter(None, lines)

# break each line into id and sequence pairs
pairs = map(string.split, lines)

# this breaks each sequence into the small pieces
# generates each individual line in the file
def chunk(pair, length=50):
    seqid, sequence = pair
    temp = []
    for j in range(0,len(sequence),length):
        temp.append( (seqid + ' ' + sequence[j:j+length]) )
    return temp

# here we apply the chunking function
data = map(chunk, pairs)

# this is really an idiomatic expression to perform 
# the inverse of what zip usually does
transposed = zip( *data )

# generate the output
for elems in transposed:
    print '\n'.join(elems)
    print

The only tricky part is the zip(*data), to best understand what it does see:

a = [ (1,2,3), (4,5,6) ]
b = zip(*a)
c = zip(*b)

print a
print b
print c

it will print:

[(1, 2, 3), (4, 5, 6)]
[(1, 4), (2, 5), (3, 6)]
[(1, 2, 3), (4, 5, 6)]
ADD COMMENT
0
Entering edit mode

@ Istvan, how can i get clustal multiple sequence alignment scores in each line like on the right side column of

ADD REPLY
0
Entering edit mode

it that alignment score or simply the step count, 50, 100, 150 etc. You see this is why it is important to give people access to the actual files for both the input and output, otherwise we are left guessing

ADD REPLY
0
Entering edit mode

I am trying to make project using Clustalw2(multiple sequence alignment) biopython wrapper. Its somehow similar to http://www.ebi.ac.uk/Tools/ -> sequence analysis-> clustalw. Fasta Input file for example is (http://pastebin.com/JA76QTJG) and output alignment file is (http://pastebin.com/M4KQX46u). Now i want to read the alignment file and show (id, sequence and scores) records. I'm not sure whether its count or alignment score.

ADD REPLY
2
Entering edit mode
14.5 years ago
Paulo Nuin ★ 3.7k

This is a function that you can use to do that

def format_output(sequence, length):
    temp = []
    for j in range(0,len(sequence),length):
        temp.append(sequence[j:j+length])
    return '\n'.join(temp)

just loop your sequences through it.

ADD COMMENT
0
Entering edit mode

Really nice approach! Thx :)

ADD REPLY
0
Entering edit mode

Not as Pythonic as it could be, but simple to understand.

ADD REPLY

Login before adding your answer.

Traffic: 1655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6