Hi!
I have some bam files aligned to mouse reference genome. I want to analyze some motif enrichment in the 3'UTR of this samples but I realize that some mRNA have longer 3'UTR than the reference genome. I double checked this an is something reported in platelets that is what I'm working.
Is there any way to get the full 3'UTR using, for example coverage and expand the preexistent 3'UTR bed file? I was thinking on a kind of loop that check coverage every 100bp and, if coverage is > threshold, append the sequence... Any other ideas? Actually I am not quite sure that the previous idea would work...
Thanks!
I think what you mean is that the actual 3'UTR is longer than the annotated in the reference file (GTF). Better use a dedicated tool, such as
stringtie
to perform reference-guided transcriptome assembly and then check if you indeed find transcripts that extend beyond the annotated 3'UTRs. Still, keep in mind that for motif enrichment your query sequences should not be excessively long to avoid motif matches by chance. Homer for example uses 200bp as a default window.Or use a tool that does comparative analysis: DREAM will allow you to put in a set of negative sequences and find things that are enriched in your postive set vs the negative set, free of length confounding.
I tried
stringtie
with gencode gtf for mouse. The file I got does not have any line with UTR feature. The gtf contains this information and also when I check the bam files in IGV I see reads in the UTRs. Any idea what is going on?stringtie
has no way of distinguishing between coding and non-coding exons. Thus is can identify where the exons are, but not which bit of them are coding/non-coding. One solution would be to select transcripts from your new assembly that are identical to reference transcripts, except for the extended first/last exon, and copy across the CDS annotation from the reference transcript.Uhm, the feature column in the output should. At least is what it claims in the documentation, right?
Anyway, I am not sure what you are referring to. Is something like comparing the two gtf files (the mouse reference and the stringtie output) and get the lines from the second that are not in the first?
Thanks
I think you are miss-reading the manual. To me the bit I think you are reffering to is just a cut and paste job of the definition of the GTF format. I've never seen anything come out of
stringtie
other than transcript and exon lines. How would it know where the CDS/UTRs were?So, yes I was talking about comparing the GTF files, but you can't do it on a line by line format, you'd need to consider a whole transcript at once, which is multiple lines and compare transcripts by checking their intron chains. You might take a look at
cuffcompare
which will classify novel transcripts according to how they match known transcripts.