I downloaded the RepeatMasker track for the mm9 genome from the Tables section of the UCSC genome browser. I get entries like these:
#bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd genoLeft strand repName repClass repFamily repStart repEnd repLeft id
607 687 174 0 0 chr1 3000001 3000156 -194195276 - L1_Mur2 LINE L1 -4311567 1413 1
My question is: how can I know what is the reference sequence that this particular genomic location was aligned to? I understand that the Smith-Waterman alignment score is a result of aligning this piece of the genome to the reference, but it is the actual reference sequence that of the particular repeat that I'm trying to find. How can this be accessed?
EDIT: To elaborate on this, consider the above example where an "L1_Mur2" type repetitive element is described. I got the most recent RepBase I could find (release 20110920), and this yields:
$ grep "L1_Mur2" RepeatMaskerLib.embl ID L1_Mur2_5end repeatmasker; DNA; ???; 970 BP. CC L1_Mur2_5end DNA DE RepbaseID: L1_Mur2_5end ID L1_Mur2_orf2 repeatmasker; DNA; ???; 4675 BP. CC L1_Mur2_orf2 DNA DE RepbaseID: L1_Mur2_orf2 ID L1_Mur2_3end repeatmasker; DNA; ???; 1463 BP. CC L1_Mur2_3end DNA DE RepbaseID: L1_Mur2_3end
Therefore, there are three entries for this type of element: a consensus for the 5' end, the 3' end, and for ORF2 of the element. This makes sense, but how can I tell from the UCSC repeatmasker line which of those was the consensus that the element in question was aligned to? I.e., how can I tell if it was a match to the 5' end, the 3' end or the ORF? These would be very different and I don't know how to tell that. Any ideas on this?
Also, are the coordinates repStart, repEnd, repLeft in the coordinate space of the reference or of the genome? It sounds to me from googling that it is the former, but in that case it seems impossible to interpret without having the reference sequence -- we don't know how long it is, for example, just by looking at this table, right?
Finally, I was hoping someone can explain what the milliDiv, milliIns, and milliDev fields are and what those units mean.
Thank you.
Thanks so much for the great explanation - I added a follow up question on this, giving an example where it's not clear to me how to map the ucsc table to repbase. Any ideas on this??
Awesome reply. Everything makes sense now :)
Hi Casey, thanks for the explanation. Just one thing I found is that the 2nd, 3rd, and 4th columns of the output are not multiplied by 1000. It is actually 10, not 1000. Please refer Greg's comment at http://genome.soe.ucsc.narkive.com/1AQHfEqU/question-about-the-repeatmasker-track.