Question

How does Ensembl decide what to mask?

0

Entering edit mode

10.1 years ago

dbweissman ▴ 10

Please excuse how basic this question is; I'm a bioinformatics newbie. I'm looking at the 7-way primate genome alignment in Ensembl release 76 (ftp://ftp.ensembl.org/pub/release-76/emf/ensembl-compara/epo_7_primate), and I don't understand how it decides which bases should be (soft-)masked. For example, in the first file (chr1_1.emf), around line 250 there is a run of T's followed by some other bases that is consistently masked for some genomes and not others, even though the sequences are identical. What's going on here?

masking Ensembl sequence genome alignment • 2.2k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by dbweissman ▴ 10

0

Entering edit mode

Odds are good that that comes originally from Repeatmasker, in which case differences in the repeat databases used for each organism could cause what you're seeing (obviously, it would take someone from Ensembl to give you the real answer).

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Devon Ryan 104k

Ram · Accepted Answer · 2014-11-03

1

Entering edit mode

10.1 years ago

Denise CS ★ 5.2k

The columns you are referring to are from the predicted ancestral sequences and we (Ensembl) don't repeat-mask ancestral sequences. In the EMF file, each column is a species, as indicated in the header SEQ elements. You can see at the beginning of the file that extant species are mixed with ancestral ones.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Denise CS ★ 5.2k

0

Entering edit mode

Ha, I should have noticed that the unmasked ones were all the ancestral sequences... Thanks!

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by dbweissman ▴ 10