Entering edit mode
10.5 years ago
andrew.j.skelton73
6.6k
I've ran the tuxedo pipeline on some RNA Seq data I have and I'm confused about what the Locus is meant to represent. As you can see in my example below, they all have the same locus (a region of 19,241 bases), however, everything else about them is different, Tracking ID, Transcript ID, TSS ID, etc.
I thought it might have been linked to XLOC ID, but if there are multiple XLOX IDs per Locus then that doesn't make sense. Does anyone know how the "Locus" field is determined in the Tuxedo package?
The manual states : "Genomic coordinates for easy browsing to the object"
Hint: Look at that region in a genome browser. Note how there are multiple overlapping genes...
Yes, those genes overlap, and the locus is there so that you can see that region in a genome browser, that part I get. My question is how is the locus determined? The above example shows two different gene names within the same locus, under two different XLOC codes.
When you visualise that in a genome browser you see this:
There are no overlapping transcripts between.
The only real answer would be to look through the cufflinks source code, since this isn't documented anywhere. I would guess that these are merged into a single locus for processing because the annotation file you gave to cufflinks, likely combined with the modifications it made to the annotated transcripts given your alignments, produced possibly overlapping features (genes in this case) that might need to be processed as a single unit. If you used an unstranded library where WASH7P was expressed, then cufflinks might have just merged that, DDX11L1, and MIR1302-10 into a single transcript, in which case treating the whole region as a single locus would make more sense. I suspect that cufflinks pre-bins the genome according to possible cases like this and then processes them separately, often producing multiple final loci. That's a slightly educated guess, at least.
Welcome to the wonderful world of completely undocumented features :P
Hi, Have you found the reason why multiple XLOC ids have the same locus? I recently ran cuffnorm and output have same locus for multiple XLOC ids.
I never did find out why, but I suspect Devon's answer above is on the money about binning chunks. I'd honestly suggest you stay away from the tuxedo pipeline and try DESeq2's workflow, or even Kallisto+Sleuth for isoform level events.