I want to split a concatenated alignment in nexus-format into the individual gene alignments. The gene boundaries are indicated below the alignment. Any programs that can do this? My goal is to create phylogenies for each individual gene.
Non-interleaved, each taxon on a long line. It seems that the file is written by Mesquite. I tried to open it in Mesquite, but it didn't recognize the partitions. The partitions are described after the alignment like this (not sure what the DEFTYPE line means though...):
Hmm, no I don't have any scripts on hand to deal with that situation sorry. Was going to try and write a little python script that pulls out the matrix block, then subsets into new nexus files based on the gene positions, but because of the number of genes I would also have to parse the assumption block. Thought it may have been a very fast script to write, but I should really study for these finals. Sorry I couldn't be any help, hopefully someone else finds a solution for you.
The only suggestion I can give is to play around with the Nexus parser in biopython, and try and twist it to solve this problem.
from Bio import SeqIO
import re
charset=[]
with open ("alignment.nexus", "rU") as f:
for line in f:
line=line.strip ()
if "CHARSET" in line:
line = line.split (" = ")
charset.append (line [1])
for c in charset:
with open (c+"_aligned.fas","w") as out:
for record in SeqIO.parse ("alignment.nexus", "nexus"):
m=re.search("(.*) - (.*) ;",c)
if m:
start = int(m.groups()[0])
end = int(m.groups()[1])
print >> out, ">"+record.id
print >> out, record.seq [start:end]
I get some errors though. Does the nexus parsers assume interleaved format?
Traceback (most recent call last):
File "nexus.py", line 19, in <module>
for record in SeqIO.parse ("est.holozoa.nex", "nexus"):
File "//anaconda/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 586, in parse
for r in i:
File "//anaconda/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 580, in <genexpr>
I = (r for alignment in AlignIO.parse(fp, format,
File "//anaconda/lib/python2.7/site-packages/Bio/AlignIO/__init__.py", line 370, in parse
for a in i:
File "//anaconda/lib/python2.7/site-packages/Bio/AlignIO/NexusIO.py", line 40, in NexusIterator
n = Nexus.Nexus(handle)
File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 587, in __init__
self.read(input)
File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 638, in read
self._parse_nexus_block(title, contents)
File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 679, in _parse_nexus_block
getattr(self, '_' + line.command)(line.options)
File "//anaconda/lib/python2.7/site-packages/Bio/Nexus/Nexus.py", line 899, in _matrix
"(check dimensions/interleaving)" % (id, c, iupac_seq))
I edited the file. it worked on a test nexus file for me. Are you sure that your nexus format is ok? If you try:
from Bio import SeqIO
for record in SeqIO.parse("alignment.nexus","nexus"):
print record.id
does that throw an error?
ADD REPLY
• link
updated 2.8 years ago by
Ram
44k
•
written 10.1 years ago by
Whetting
★
1.6k
0
Entering edit mode
Thanks, really appreciate this.
Yes this gives the same error. It says that the first taxon contains an illegal character and prints the whole alignment-line for that taxon, ending with (check dimensions/interleaving).
I believe my nexus file is correct, starts like this:
#NEXUS
BEGIN TAXA;
TITLE EST_Holozoa_Taxa;
DIMENSIONS NTAX=64;
END;
BEGIN DATA;
TITLE EST_Holozoa_Matrix;
DIMENSIONS NCHAR=88384;
FORMAT DATATYPE = Protein GAP = - MISSING = ?;
MATRIX
Oscarella_carmela LFYSFFKSLVGKEVMVELKNDLSICGTLHSVDQFLNIKLTDVSVPD....
....
;
END;
BEGIN ASSUMPTIONS;
CHARSET gene1 = 1 - 93 ;
CHARSET gene2 = 94 - 313 ;
...
OPTIONS DEFTYPE = unord PolyTcount = MINSTEPS ;
END;
In the above example, start and end are only defined in an if statement:
if m:
start = int(m.groups()[0])
end = int(m.groups()[1])
If your regular expression doesn't match, they won't be defined. This can be handled in a couple of ways depending on what you want to do (move the writing into the if statement, for example), but the problem lies here.
Do you have an interleaved nexus file or does each sample just have one long line with the sample name and then sequence?
Non-interleaved, each taxon on a long line. It seems that the file is written by Mesquite. I tried to open it in Mesquite, but it didn't recognize the partitions. The partitions are described after the alignment like this (not sure what the DEFTYPE line means though...):
An individual nexus file for each phylogeny? and also how many genes in that ASSUMPTIONS block?
The format of the individual files is not that important. I will run RAxML so phylip is the best. But nexus or fasta is also fine.
There are 406 genes in ASSUMPTIONS.
Hmm, no I don't have any scripts on hand to deal with that situation sorry. Was going to try and write a little python script that pulls out the matrix block, then subsets into new nexus files based on the gene positions, but because of the number of genes I would also have to parse the assumption block. Thought it may have been a very fast script to write, but I should really study for these finals. Sorry I couldn't be any help, hopefully someone else finds a solution for you.
The only suggestion I can give is to play around with the Nexus parser in biopython, and try and twist it to solve this problem.
Ok, but thanks a lot anyway,