Hi,
I am trying to assess the evolutionary conservation of putative small protein coding regions. I am following this paper which does a similar thing. This is the relevant bit from the paper:
We downloaded previously generated multiple-genome alignments for 120 mammals (Hecker and Hiller, 2020) and we extracted aligned regions for each of the human Ribo-seq ORFs. Only species where the ORF region could be fully aligned were included in each alignment. As a result, 99.5% and 97.23% of the ORFs were aligned to at least one primate species and a non-primate mammalian species respectively.
I am trying to implement this on an alignment of 20 mammal species first. I am using the following python code to obtain the alignment for a putative ORF regions:
from Bio import AlignIO
from Bio.AlignIO import MafIO
from Bio.AlignIO.MafIO import MafIndex
idx = MafIO.MafIndex("chr1.mafindex", "chr1.maf", "hg38.chr1")
results = idx.search([108692617], [108692757])
AlignIO.write(results, "chr1.fa", "fasta")
This gives me 3 alignments that overlap with the genomic positions that I specified. It looks like this:
Alignment with 18 rows and 323 columns
CTTCTGCGT--GGAGTTCTCGCGGTCTGGGTTTCGCTGTCTGCT...agt hg38.chr1
CTTCTGCGT--GGAGTTCTCGCGGTCTGGGTTTCGCTGTCTGCT...agt panTro4.chr1
CTTCTGCGT--GGAGTTCTCGCGGTCTGGGTTTCGCTGTCTGCT...Agt panPan1.JH650206
CTTCTGCGT--GGAGTTCTCGCGGTCTGGGTTTCGCTGTCTGCT...agt gorGor3.chr1
TTTCTGCGT--GGAGTTCTCGCGGTCTGGGTTTCGCTGTCTGCT...AGT ponAbe2.chr1
CTTCTGCGT--GGAGTTCTCGCGGTCTGGGTTTCGCTGTCTGCT...agt nomLeu3.chr12
CTTCTGCGT--GGAGTTCTCGCGGACTGGGTTTCGCTGTTTGCT...AGT rheMac3.chr1
CTTCTGCGT--GGAGTTCTCGGGGACTGGGTTTCGCTGTTTGCT...agt macFas5.chr1
CTTCTGCGT--GGAGTTCTCGCGGACTGGGTTTCGCTGTTTGCT...agt papAnu2.chr1
CTTCTGCGT--GGAGTTCTCGCGGACTGGGTTTCGCTGTCTGCT...agt chlSab2.chr20
CTTCTCCGT--GGAGTTCTCGCGGACTGGGTTTCGCTGTCTGCT...agt nasLar1.chr1
CTTCTCCGT--GGAGTTCTCGCGGACTGGGTTTCGCTGTCTGCT...agt rhiRox1.KN294410v1
CTTCTGCGT--GGAGTTCTCGCGGTCTGCGTTTCGCTGTGTGCT...agt calJac3.chr7
CTTCTGCGT--GGAGTTCTCGCGGTCTGCGTTTCGCTGTCTGCT...AGT saiBol1.JH378161
CTGCTGCCT--GGAGTTCTCGTGGGTCGCTTCTCGCCGTCTGCT...agt micMur1.scaffold_8310
CTTCTTCCT--GGAGATCTCGCGGGCCGCTTCTCAGCGTCTGCT...AGT otoGar3.GL873529
--TTTGCGTCCGGAGTTCTCCCCGCCTCGTTTTTGGCGTCCGTC...AAT mm10.chr3
CTTCTGCCG--GGAGTTCTCGCGGCCCGCGCTTCGGCGTCTACG...agt canFam3.chr6
Alignment with 19 rows and 22 columns
---gcggcggcggcggcgCTAC hg38.chr1
---gcggcggcggcggcGCCAC panTro4.chr1
---gcggcggcggcggcgccaC panPan1.JH650206
---gcggcggcggcggcGCTAC gorGor3.chr1
---GCGGCGGCGGCGGCGCTAC ponAbe2.chr1
---gcggcggcggcggcGCTAC nomLeu3.chr12
---GCGGCGGCGGCGGCGCTAC rheMac3.chr1
---gcggcggcggcggcgCTAC macFas5.chr1
---gcggcggcggcggcgctaC papAnu2.chr1
---gcggcggcggcggcgCTAC chlSab2.chr20
---gcggcggcggcggcgCTAC nasLar1.chr1
---gcggcggcggcggcgCTAC rhiRox1.KN294410v1
---gcggcggcggcggcGCCAC calJac3.chr7
---GCGGCGGCGGCGGCGCCAC saiBol1.JH378161
---gcggcggcggcggcgCCAC tarSyr2.ABRT02207888v1
---gcggcggcggcggcGCCAC micMur1.scaffold_8310
---gcggcggcggcggcgCCAC otoGar3.GL873529
---GCGGCGGCGGCGGCGCCAC mm10.chr3
gcggcggcggcggcggcgCCAC canFam3.chr6
Alignment with 20 rows and 289 columns
CAAGCCGGCGGTCTCCGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG hg38.chr1
CAAGCCGGCGGTCTCCGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG panTro4.chr1
CAAGCCGGCGGTCTCCGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG panPan1.JH650206
CAAGCCGGCGGTCTCCGGAAAGCAGGGCAA-TGTGCTCCCGCTC...TTG gorGor3.chr1
CAAGCCGGCGGTCTCCGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG ponAbe2.chr1
CAAGCCGGCGGTCTCCGGCAAGCAGGGCAA-TGTGCTCCCTCTC...TTG nomLeu3.chr12
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG rheMac3.chr1
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG macFas5.chr1
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG papAnu2.chr1
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG chlSab2.chr20
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG nasLar1.chr1
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG rhiRox1.KN294410v1
CAAGCCGGCGGTCTCGGGGAAGCAGGGCAA-TGTGCTCCCGCTG...TTG calJac3.chr7
CAAGCCAGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTG...TTG saiBol1.JH378161
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTGCCGCTG...TTG tarSyr2.ABRT02207888v1
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTCCTGCCGCTG...TTG micMur1.scaffold_8310
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTCCTGCCACTG...CTG otoGar3.GL873529
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAATTGTGCTGCCGCTT...TTG tupBel1.scaffold_143498.1-152902
CAAGCCGGCGGTGTCGGGCAAGCAGGGCAA-TGTGCTGCCGCTG...TTG mm10.chr3
CAAGCCGGCGGTCTCGGGCAAGCAGGGGAA-TGTGCTGCCGCTG...CCG canFam3.chr6
How do I manipulate the alignment so that I am able to obtain ORF regions where there is full alignment? Using the above as an example, the multiple alignment is actually in 3 parts. Do I stitch the 3 alignments together and basically remove species which have a '-' in the alignment?