How to extract MAF block regions that have full alignment?
0
0
Entering edit mode
2.9 years ago
nattzy94 ▴ 60

Hi,

I am trying to assess the evolutionary conservation of putative small protein coding regions. I am following this paper which does a similar thing. This is the relevant bit from the paper:

We downloaded previously generated multiple-genome alignments for 120 mammals (Hecker and Hiller, 2020) and we extracted aligned regions for each of the human Ribo-seq ORFs. Only species where the ORF region could be fully aligned were included in each alignment. As a result, 99.5% and 97.23% of the ORFs were aligned to at least one primate species and a non-primate mammalian species respectively.

I am trying to implement this on an alignment of 20 mammal species first. I am using the following python code to obtain the alignment for a putative ORF regions:

from Bio import AlignIO
from Bio.AlignIO import MafIO
from Bio.AlignIO.MafIO import MafIndex

idx = MafIO.MafIndex("chr1.mafindex", "chr1.maf", "hg38.chr1")
results = idx.search([108692617], [108692757])
AlignIO.write(results, "chr1.fa", "fasta")

This gives me 3 alignments that overlap with the genomic positions that I specified. It looks like this:

Alignment with 18 rows and 323 columns
CTTCTGCGT--GGAGTTCTCGCGGTCTGGGTTTCGCTGTCTGCT...agt hg38.chr1
CTTCTGCGT--GGAGTTCTCGCGGTCTGGGTTTCGCTGTCTGCT...agt panTro4.chr1
CTTCTGCGT--GGAGTTCTCGCGGTCTGGGTTTCGCTGTCTGCT...Agt panPan1.JH650206
CTTCTGCGT--GGAGTTCTCGCGGTCTGGGTTTCGCTGTCTGCT...agt gorGor3.chr1
TTTCTGCGT--GGAGTTCTCGCGGTCTGGGTTTCGCTGTCTGCT...AGT ponAbe2.chr1
CTTCTGCGT--GGAGTTCTCGCGGTCTGGGTTTCGCTGTCTGCT...agt nomLeu3.chr12
CTTCTGCGT--GGAGTTCTCGCGGACTGGGTTTCGCTGTTTGCT...AGT rheMac3.chr1
CTTCTGCGT--GGAGTTCTCGGGGACTGGGTTTCGCTGTTTGCT...agt macFas5.chr1
CTTCTGCGT--GGAGTTCTCGCGGACTGGGTTTCGCTGTTTGCT...agt papAnu2.chr1
CTTCTGCGT--GGAGTTCTCGCGGACTGGGTTTCGCTGTCTGCT...agt chlSab2.chr20
CTTCTCCGT--GGAGTTCTCGCGGACTGGGTTTCGCTGTCTGCT...agt nasLar1.chr1
CTTCTCCGT--GGAGTTCTCGCGGACTGGGTTTCGCTGTCTGCT...agt rhiRox1.KN294410v1
CTTCTGCGT--GGAGTTCTCGCGGTCTGCGTTTCGCTGTGTGCT...agt calJac3.chr7
CTTCTGCGT--GGAGTTCTCGCGGTCTGCGTTTCGCTGTCTGCT...AGT saiBol1.JH378161
CTGCTGCCT--GGAGTTCTCGTGGGTCGCTTCTCGCCGTCTGCT...agt micMur1.scaffold_8310
CTTCTTCCT--GGAGATCTCGCGGGCCGCTTCTCAGCGTCTGCT...AGT otoGar3.GL873529
--TTTGCGTCCGGAGTTCTCCCCGCCTCGTTTTTGGCGTCCGTC...AAT mm10.chr3
CTTCTGCCG--GGAGTTCTCGCGGCCCGCGCTTCGGCGTCTACG...agt canFam3.chr6
Alignment with 19 rows and 22 columns
---gcggcggcggcggcgCTAC hg38.chr1
---gcggcggcggcggcGCCAC panTro4.chr1
---gcggcggcggcggcgccaC panPan1.JH650206
---gcggcggcggcggcGCTAC gorGor3.chr1
---GCGGCGGCGGCGGCGCTAC ponAbe2.chr1
---gcggcggcggcggcGCTAC nomLeu3.chr12
---GCGGCGGCGGCGGCGCTAC rheMac3.chr1
---gcggcggcggcggcgCTAC macFas5.chr1
---gcggcggcggcggcgctaC papAnu2.chr1
---gcggcggcggcggcgCTAC chlSab2.chr20
---gcggcggcggcggcgCTAC nasLar1.chr1
---gcggcggcggcggcgCTAC rhiRox1.KN294410v1
---gcggcggcggcggcGCCAC calJac3.chr7
---GCGGCGGCGGCGGCGCCAC saiBol1.JH378161
---gcggcggcggcggcgCCAC tarSyr2.ABRT02207888v1
---gcggcggcggcggcGCCAC micMur1.scaffold_8310
---gcggcggcggcggcgCCAC otoGar3.GL873529
---GCGGCGGCGGCGGCGCCAC mm10.chr3
gcggcggcggcggcggcgCCAC canFam3.chr6
Alignment with 20 rows and 289 columns
CAAGCCGGCGGTCTCCGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG hg38.chr1
CAAGCCGGCGGTCTCCGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG panTro4.chr1
CAAGCCGGCGGTCTCCGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG panPan1.JH650206
CAAGCCGGCGGTCTCCGGAAAGCAGGGCAA-TGTGCTCCCGCTC...TTG gorGor3.chr1
CAAGCCGGCGGTCTCCGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG ponAbe2.chr1
CAAGCCGGCGGTCTCCGGCAAGCAGGGCAA-TGTGCTCCCTCTC...TTG nomLeu3.chr12
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG rheMac3.chr1
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG macFas5.chr1
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG papAnu2.chr1
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG chlSab2.chr20
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG nasLar1.chr1
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTC...TTG rhiRox1.KN294410v1
CAAGCCGGCGGTCTCGGGGAAGCAGGGCAA-TGTGCTCCCGCTG...TTG calJac3.chr7
CAAGCCAGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTCCCGCTG...TTG saiBol1.JH378161
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTGCTGCCGCTG...TTG tarSyr2.ABRT02207888v1
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTCCTGCCGCTG...TTG micMur1.scaffold_8310
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAA-TGTCCTGCCACTG...CTG otoGar3.GL873529
CAAGCCGGCGGTCTCGGGCAAGCAGGGCAATTGTGCTGCCGCTT...TTG tupBel1.scaffold_143498.1-152902
CAAGCCGGCGGTGTCGGGCAAGCAGGGCAA-TGTGCTGCCGCTG...TTG mm10.chr3
CAAGCCGGCGGTCTCGGGCAAGCAGGGGAA-TGTGCTGCCGCTG...CCG canFam3.chr6

How do I manipulate the alignment so that I am able to obtain ORF regions where there is full alignment? Using the above as an example, the multiple alignment is actually in 3 parts. Do I stitch the 3 alignments together and basically remove species which have a '-' in the alignment?

species multiple alignment python • 484 views
ADD COMMENT

Login before adding your answer.

Traffic: 1818 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6