Hi,
I performed RNA sequencing using a poly A 3' tagging/sequencing approach. I therefore expect the sensible reads to map to only the 3' end of the transcripts in my sample. I want to subset the GENCODE gene definitions to only include the UTR + 1kb of exon. What is the best way to do this? I had the following ideas:
1) "grep" the GENCODE definition files for "UTR' lines, then find exons whose coordinates are immediately adjacent, and keep going backwards (to get more neighboring exons) until i get my 1kb.
2) "grep" the GENCODE files for "stop_codon" lines, then keep getting exons whose coordinates are immediately adjacent to the "stop_codon" coordinates, until I get my 1kb.
3) find the "transcript" lines of the GENCODE file, try to match them to the "UTR" lines, then select the last 1kb of the 'transcript' definitions (and add on the coordinates for the UTR).
besides trying to figure out what the best way to get these 3' end coordinates, I also had the following question:
1) should all transcript definitions have a "UTR" line in the GENCODE definition files? 2) should all "UTR" definitions have adjacent "exons"?
Thanks!
My experience is that UTR annotation is not as good as you would hope. Are you using Lexogen Quantseq by any chance?
nope, new tech. What would you suggest?
Nothing conclusive yet, but I'm trying things like extending my UTR sequences (1kb) starting from the stop codon... (my sequencing is stranded so that's quite safe).